Assignment 1: About the project

Here’s the link to the repo on github: https://github.com/oj-lappi/IODS-project

My background and motivations

I’d like to learn about statistical tests and possibly how to add R to data pipelines.

I don’t think R will replace my current tools for analyzing and plotting data, but I do want to at least see how feasible it would be to add R as a tool for developing statistical analyses for experiments, and possibly writing some small programs for quick statistical analysis in postprocessing of experiments.

I’m familiar with git, it’s a daily driver for me, I tend to use gnuplot and python for quick plotting, and either use python, UNIX tools or custom C++ programs for data analysis utilities.

Goals

I’m interested in creating automatic data pipelines that process data in object storage and upload plots to a dashboard. R+Github pages might be a good way to do that, although I would prefer an on-prem solution (HU’s gitlab maybe?) or a self-hosted website.

RStudio/Git integration

I have ssh keys set up on each of my computers, and a personal access token would be less flexible than that.

I can push and pull with this setup from RStudio.

Exercises

I checked the exercises and followed some of them along locally, I think I’m getting the hang of it :).

# Timestamp:

date()
## [1] "Thu Nov 30 16:15:51 2023"

Assignment 2: Linear regression on the Learning2014 dataset

1. Reading the data

#Dependencies
#install.packages(c("readr","lmtest", "dplyr"))
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Timestamp
date()
## [1] "Thu Nov 30 16:15:51 2023"

2. Analysis

lrn2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1 The shape of the data

## [1] "dimensions: 166" "dimensions: 7"

There are 166 rows with 7 columns each.

The spec is:

## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )

2.2 Descriptive statistics

Sex distribution

gender seems to be a categorical value, so let’s see the number of rows per gender (sex):

## # A tibble: 2 × 2
##   gender count
##   <chr>  <int>
## 1 F        110
## 2 M         56

110 F, and 56 M, there is a skew towards female students in the dataset. Let’s plot that.

## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
fig. 1.1, sex distributions

fig. 1.1, sex distributions

Age distribution

Let’s plot the age distribution of the students.

lrn2014 %>%
  ggplot(aes(x = age)) +
  stat_count()
fig. 1.2, age distributions

fig. 1.2, age distributions

min(lrn2014$age)
## [1] 17
max(lrn2014$age)
## [1] 55
median(lrn2014$age)
## [1] 22

The age range is from 17 to 55, and the median is 22. Visually inspecting the distribution, the mode of the distribution is early twenties, as you would expect, although there is a long tail.

Age-sex distribution

Let’s combine the two columns into a classic population pyramid, or age-sex pyramid.

But that’s not exactly what we want. It turns out a population pyramid is not an out-of-the-box plot we can easily produce, we have to manually calculate bins and do some magic.

lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  group_by(gender, age_bin) %>%                      # Group by bin and gender
  summarize(count =n()) %>%                          # Sum over groups
  mutate(count = 
      if_else(gender == 'M', -count, count)) %>%     # Turn one negative
  ggplot(aes(x=count, y = age_bin)) +
  geom_col(aes(fill=gender)) 
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3a, population pyramid

fig. 1.3a, population pyramid

There are very few male students under 20, I would speculate that this is due to Finnish army conscription, otherwise the distribution seems roughly equal on the female and male sides.

We can of course bin by year instead of 5 years, and we get a higher resolution but more noise.

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3b, population pyramid #2, fine-grained bins

fig. 1.3b, population pyramid #2, fine-grained bins

There is one peculiar decline in the female student participation around ~26-28 which jumps back after thirty. This might be a maternity effect, but this is highly speculative, there’s very few samples in this dataset.

Exam scores

Let’s look at exam scores:

paste("median:", median(lrn2014$points), ", mean:",mean(lrn2014$points), ", standard deviation:", sd(lrn2014$points))
## [1] "median: 23 , mean: 22.7168674698795 , standard deviation: 5.89488364245529"

Let’s look at the full distribution usin geom_density.

lrn2014 %>%
  ggplot(aes(x=points)) +
  geom_density()
fig. 1.4a, exam score distribution

fig. 1.4a, exam score distribution

There’s a curious valley in the density at around 12-14 points. Let’s look closer.

lrn2014 %>%
  ggplot(aes(x=points, tickwidth=1)) +
  geom_histogram(boundary=0,binwidth=1) +
  scale_x_continuous(breaks = seq(0, 40, by = 1))
fig. 1.4b, exam score histogram

fig. 1.4b, exam score histogram

So no students got 12,13, or 14 points. The jump from 11 to 15 must be behind some barrier. Let’s look at our two demographic variables.

Exploring exam score distributions for different groups

Point means by sex, with age gradient
lrn2014 %>%
  group_by(gender) %>%
  ggplot(aes(x=points)) +
  geom_histogram(boundary=0,binwidth=2, aes(fill=factor(age))) +
  facet_wrap(~gender)
fig. 1.5a, exam scores by sex, with age gradient

fig. 1.5a, exam scores by sex, with age gradient

I see no clear bias either way, perhaps a slight correlation with score and age within the mode (20-30 years) and then no correlation for higher ages. The female distribution seems most well-behaved, although the gap from 11 points up is much sharper here as well.

Point means by age group, with sex
lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  group_by(gender, age_bin) %>%
  ggplot(aes(x=points)) +
  geom_histogram(boundary=0,binwidth=2, aes(fill=gender)) +
  facet_wrap(~age_bin)
fig. 1.5b, exam scores by age group, with labeled sex

fig. 1.5b, exam scores by age group, with labeled sex

Point means by age group, box plot
lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  ggplot(aes(x=age_bin, y = points)) +
  stat_boxplot(geom = "errorbar", width = 0.25) +
      geom_boxplot()
fig. 1.6a, exam score distributions by age group, box plot sequence

fig. 1.6a, exam score distributions by age group, box plot sequence

lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  ggplot(aes(x=age_bin, y = points)) +
  stat_boxplot(geom = "errorbar", width = 0.25) +
      geom_boxplot()+
  facet_wrap(~gender)
fig. 1.6b, exam score distributions by age-sex group, box plot sequence

fig. 1.6b, exam score distributions by age-sex group, box plot sequence

I see no correlation between age and exam score in these plots.

2.3 Correlation matrix

Finally, let us look at all the survey question scores together with the already explored variables in a correlation matrix.

lrn2014 %>%
  select(gender, age, surf, stra, deep, attitude, points) %>%
  ggpairs(aes(fill=gender, color=gender),lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 1.7, correlation matrix of all variables

fig. 1.7, correlation matrix of all variables

There seem to be negative correlations between: - surf and deep (but seemingly only strongly for male students) - attitude and deep (weak, but stronger for male students again) - stra and surf (weak)

And a possitive correlation between points and attitude! This is the strongest linear relationship in the data. And we can verify that age does not seem to have an effect at all.

There also seems to be a relationship between attitude and gender, but no relationship between attitude and points.

3. Regression

Based on the data exploration, attitude seems the most likely candidate, and next after that: stra and surf.

3.1 Simple regression as baseline

Let’s start with a simple regression model of points ~ attitude, which will be our baseline.

attitude_model <- lrn2014 %>%
                  lm(points ~ attitude, data = .)
summary(attitude_model)
## 
## Call:
## lm(formula = points ~ attitude, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is clearly a statistically significant relationship between attitude and points. But R-squared is only around 18.5%, so there is a lot of variance not explained by the model.

3.2 Multiple regression, 3 variables

three_var_model <- lrn2014 %>%
                  lm(points ~ attitude + stra + surf, data = .)
summary(three_var_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The adjusted R-squared is higher, which means our model is capturing more of the underlying interactions than before, although still below 20%.

It seems that the relationship between points and surf is not statistically significant.

Let’s drop surf and keep stra, and try again.

3.3 Multiple regression, 2 variables

two_var_model <- lrn2014 %>%
                  lm(points ~ attitude + stra, data = .)
summary(two_var_model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Right, this is our best fit yet, based on the adjusted R-squared. Depending on what our choice of significance level would be in a hypothesis test, the interaction with stra would be ignored. At a standard level of a = 0.05, we wouldn’t reject the null hypothesis that stra has a linear effect on points.

Let’s test another model, with nothing but stra.

3.4 Simple regression with only stra

stra_model <- lrn2014 %>%
                  lm(points ~ stra, data = .)
summary(stra_model)
## 
## Call:
## lm(formula = points ~ stra, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5581  -3.8198   0.1042   4.3024  10.1394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.233      1.897  10.141   <2e-16 ***
## stra           1.116      0.590   1.892   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared:  0.02135,    Adjusted R-squared:  0.01538 
## F-statistic: 3.578 on 1 and 164 DF,  p-value: 0.06031

We get very close to a statistically significant result at a standard significance level of 0.05, but not quite. Let’s drop stra, since that’s what the assignment asked us to do.

The model we will be using is therefore the baseline model with one predictor: attitude.

4. Interpretation

4.1 Summary

Let’s rerun that summary:

summary(attitude_model)
## 
## Call:
## lm(formula = points ~ attitude, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The fitted regression coefficients are: - intercept: 11.6372 - attitude: 3.5255

Which means the that the model predicts the conditional mean of the exam scores, given an attitude value as:

points = 11.6372 + 3.5255*attitude

If we multiply the attitude coefficient with the range of the attitude in the population, we can get an idea of how the model assigns expected exam scores based on a students attitude.

am <- mean(lrn2014$attitude)
as <- sd(lrn2014$attitude)
print("Range of predictor term within a sd:")
## [1] "Range of predictor term within a sd:"
3.5255*c(am-as, am, am+as)
## [1]  8.506549 11.079839 13.653130
print("Range of ŷ_i within a sd:")
## [1] "Range of ŷ_i within a sd:"
11.6732 + 3.5255*c(am-as, am, am+as)
## [1] 20.17975 22.75304 25.32633
print("Range of ŷ_i within two sd's:")
## [1] "Range of ŷ_i within two sd's:"
11.6732 + 3.5255*c(am-2*as, am, am+2*as)
## [1] 17.60646 22.75304 27.89962
spec(lrn2014)
## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )

So, assuming attitude is Gaussian and that the sample stddev is a good estimate, the model assigns: an exam score: - between 20.17975 and 25.32633 to a majority of students (about 68%, one stddev in a Gaussian captures 34.1% of the population) - between 17.60646 and 27.89962 to a super-majority (95%) of students

If we look back at the exam score distribution in figure 1.4, this does capture the mode of the distribution.

3.2 Multiple R-squared

The multiple R-squared value is 0.1906, the standard way to express this is that “19% of the variation in exam scores is explained by the attitude variable” (see MABS4IODS, 3.2.1).

I would interpret this to mean that, using the linear model, given attitude we estimate the expectation of the standard error (squared) of this prediction to be roughly 80% of what it would be when simply using the sample mean as a predictor. The estimation assumes the sample is representative, because we’re using residuals to get this number.

I’m not quite sure if this more elaborate interpretation is exact, but it’s what I was able to piece together from sources online, mostly wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination).

5. Regression diagnostics

par(mfrow=c(2,2))
plot(attitude_model)
2.0 Regression diagnostics

2.0 Regression diagnostics

par(mfrow=c(1,1))

5.1 Assumptions

The assumptions of the model are that: 1. the constant variance assumption (homoscedasticity) 2. the normality assumption: 3. there’s a linear relationship between the predictor and the response

The Residuals vs. Leverage

The Residuals vs. Leverage plot checks that there aren’t any influential outliers that are affecting the fit of the regression coefficients. The plot has a dashed line showing a critical Cooks’s distance value. In our case this dashed line is not visible. Essentially, if a point is very far right and has a high standardized residual (off the central line), it’s an higly influential point and will have to be looked into.

A highly influential point may be an indication of a point that should be excluded, but this has to be done on a case-by-case basis. It might also mean that assumption 3 is violated, that there isn’t a linear relationship between predictors and the response.

The Residuals vs. Fitted plot

The residuals vs fitted plot (and the scale-location plot) gives us a visual way to check for heteroscedasticity (violation of assumption 1). If the red mean-line is not horizontal, it means the residuals have a bias in some region of the response distribution (the vairance is not constant).

I don’t think this means the data is heteroscedastic, it certainly doesn’t look like it is. But I’m not so familiar with these visual checks, so I searched the web for a homoscedasticity test, and found the Breusch-Pagan Test in the lmtest library.

attitude_model_bptest <- bptest(attitude_model)
attitude_model_bptest
## 
##  studentized Breusch-Pagan test
## 
## data:  attitude_model
## BP = 0.0039163, df = 1, p-value = 0.9501

A p-value of 0.95 means that there is definitely very little evidence that the model is heteroscedastic. Ok, good! The trend has to be much clearer than that then.

Normal QQ-plot

This plot compares the “ideal” quantiles to the sample quantiles. This is used to test for normalcy by comparing the residual distribution against a theoretical normal distribution.

There are some outliers at the bottom left, which may indicate a bimodal distribution (remember that the exam scores looked bimodal as well, fig 1.4). The plot also curves down a little at the upper end, which I believe usually indicates a light left skew.

let’s test the residuals for normalcy, using the Shapiro-Wilk test:

shapiro.test(attitude_model[["residuals"]])
## 
##  Shapiro-Wilk normality test
## 
## data:  attitude_model[["residuals"]]
## W = 0.97394, p-value = 0.003195

The p-value is quite small, 0.003, so we to reject the null hypothesis that the residuals are normally distributed.

Hmm, maybe the issue is the grading scale? I’ve tried to fix this, and I can ge the QQ-plot to look nicer with some transformations, but it makes the model test statistics worse. This is the best I can do for now.


Assignment 3: Logistic regression

1. Get the data

library(dplyr)
library(readr)
data_url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
df <- read_csv(data_url)
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Summary of variables

The variables in the data set are:

colnames(df)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset contains student school performance data averaged over two courses: portuguese and mathematics (the variables are: G1, G2, G3, absences, failures, paid). To be exact, the “paid” column is only from one of the courses. The other variables are demographic and social, the social variables were collected using surveys.

Two variables: alc_use and high_use are transformations of the original dataset added for this assignment. alc_use is the average alcohol use per day, combining self-reported weekday and weekend usage on a scale of 1 to 5. high_use is a boolean indicating whether the alc_use variable is more than 2.

3. Choose 4 interesting variables in the data to study in relation to alcohol use

I choose:

  • Fedu: Father’s education level
  • Medu: Mother’s education level
  • absences: number of absences on average per course
  • G3: final course grade average

My hypotheses is:

  • parents education levels (Fedu,Medu) has a small but significant impact on the likelihood of the high_use variable.
  • increase in absences incrreases likelihood of high alcohol usage
  • decrease in grades increases likelihood of high alcohol usage

4. Explore variables

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
library(ggplot2)
library(GGally)
df %>% select(alc_use, high_use, Fedu, Medu, absences, G3) %>%
  ggpairs(aes(fill=high_use, color=high_use))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig. 1.1, correlation matrix of variables, colored by high_use value

fig. 1.1, correlation matrix of variables, colored by high_use value

  • It turns out that the dataset shows absolutely no correlation between parent’s education and alcohol usage.
  • Fedu is very highly correlated with Medu though, which is not surprising but interesting to note, people seem to marry within their social class.
  • absences has a very high correlation with alcohol use
  • grades have a slightly smaller negative correlation with alcohol use

Two out of four hypotheses seem to have some support in the data after this exploration.

5. Logistic regression

Let’s do a logistic regression model using the two statistically significant correlations: G3 and absences.

model <- glm(high_use ~ G3 + absences, data = df, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3286  -0.8298  -0.7219   1.2113   1.9242  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38732    0.43500  -0.890 0.373259    
## G3          -0.07606    0.03553  -2.141 0.032311 *  
## absences     0.08423    0.02309   3.648 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.93  on 367  degrees of freedom
## AIC: 435.93
## 
## Number of Fisher Scoring iterations: 4

Both G3 and absences have a p-value of less than 0.5 for the Wald tests of the fit, but we have a better test that’s easier to interpret, the confidence intervals from the model.

5.1 Odds ratio

coeffs <- coefficients(model)
odds_ratio <- coeffs %>% exp
odds_ratio
## (Intercept)          G3    absences 
##   0.6788751   0.9267616   1.0878762

The odds ratios for G3 and absences are roughly 0.926 and 1.088 For G3, the OR is less than one because the correlation between the variables is negative.

The way to interpret these is that: - for each decrease in final grade average, the odds increase by roughly 8.0% that a student has high alcohol use (1/0.9267 ~= 1.08) - for each increase in absences per course, the odds increase by roughly 8.7% that a student has high alcohol use

But that’s just the average effect the model has fit, let’s look at confidence intervals in the odds ratio

5.2 Confidence intervals

ci <- confint(model) %>% exp
## Waiting for profiling to be done...
ci
##                 2.5 %    97.5 %
## (Intercept) 0.2857593 1.5832545
## G3          0.8639686 0.9935498
## absences    1.0419618 1.1407755

The 95% confidence intervals are both on one side of unity, so we can say with 95% certainty that there is an effect for both variables, and the effect increases the odds by: - a factor in the range of [1.007, 1.15] for each decrease in final grade average (again,inverting the odds ratio because the effect is negative) - a factor in the range of [1.04, 1.14] for each increase in absences per course

5.3 Interpretation

There is an effect, although the final grade average effect goes damn near 1 at the low end of the confidence interval, the unit increase in odds is only 0.7%! For absences, the confidence interval is a little tighter, but it still seems a little wide to me for practical use. We would need more samples in order to get a tighter interval.

The two hypotheses that survived the initial exploration both match the outcomes of the logistic regression, and now we can quantify the quality of an estimate of the effect.

6. Predictions

Let’s test out the model with a confusion matrix.

probabilities <- predict(model, type = "response")
df <- df %>% 
        mutate(probability = probabilities) %>%
        mutate(prediction = probability > 0.5)

table(high_use = df$high_use, prediction = df$prediction) %>%
        prop.table %>%
        addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.25675676 0.04324324 0.30000000
##    Sum   0.93783784 0.06216216 1.00000000

The confusion matrix tells us that the model has a false positive rate of ~2% and a false negative rate of ~26%. That’s pretty good! High false negative rates are not so bad, they are just missed cases. High false positives would mean that the model is unreliable and cannot be used as an indicator (in this case. The importance of different error types depend on the specific use case and the meaning of negative and positive).

On the other hand, the confusion matrix also tells us that the model preficts 94% of all students to have non-high alcohol use, while in reality the number is 70%. So the model achieves this relative safe indicator status by being rather conservative.

Caveat

We haven’t done a split into a model fit dataset and a validation set, so this confusion matrix is of limited value.


Assignment 4: Classification and Clustering

1. Overview

This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.

The methods used are linear discriminant analysis (LDA) and k-means clustering.

2. The dataset

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Dataset structure

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.

The columns contain social statistics related to these census areas (e.g. crim = crime rate, ptratio = pupil-teacher ratio), data about the housing units in the area (rm = avg # of rooms per unit, medv = median housing unit value, age = prop. houses built before 1940), and data about the location of the area (e.g. dis = weighted mean of distances to employment centers, chas = 1 if by Charles River, 0 if not by Charles River).

Some of the columns are a little counter-intuitive or difficult to interpret. E.g., the column age is the proportion of houses built before 1940, and the column lstat is the proportion of the population that is lower status. From the Harrison & Rubinfield paper, lower status means: “1/2 * (proportion of adults without some hig h school education and proportion of male workers classified as laborers)”.


Ok, before we move forward, we did see a small issue here, let’s change chas from an integer to a boolean.

library(dplyr)
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Summary

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
##       crim                zn             indus          chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Mode :logical  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   FALSE:471      
##  Median : 0.25651   Median :  0.00   Median : 9.69   TRUE :35       
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Looking at this summary, all columns are definitely not created equal. I am mostly looking at the difference between the mean and median, which — if they differ by much — can indicate that a variable is not normally distributed. Some of the columns are close to normal distributions, but e.g. zn has a median of 0 and a mean of 11.36. Other highly skewed columns are rad, crim, tax, chas, and black.

3. Graphical summary

With ggpairs

library(ggplot2)
library(GGally)
Boston_explore %>%
  ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix

fig. 3.1, Correlation matrix

Ok, lot’s to unpack here, let’s start with the a visual check of each variable’s distribution (the diagonal). Almost none of the columns look normally distributed, with perhaps the exception of rm, the number of rooms.

There are lots of interesting correlations, just looking at the scatter plots, the three values rm, medv, and lstat seem to have strikingly strong relationships with each other with medv, which makes sense to me.

The rad variable, which is an “index of accessibility to radial highways, is clearly a bimodal, or one could even say a split distribution. A subset of areas have a much higher index than the others, and in the scatter plots, this clearly visible line of that higher-index population seems to consistently cover different ranges of the other variable than the lower-index population. The effect is most clearly noticeable in the crim, tax, nox, dis and black scatter plots.

dis and nox also have a strikingly logarithmic-looking relationship.

In general, nearly every variable seems to be correlated with every other variable, excepting the chas (area is by the Charles river) column.

With corrplot

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot

fig. 3.2, Correlation matrix with corrplot

Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.

We see strong correlations (large balls) between: - dis and (zn, indus, nox, and age) - tax and rad, and this is a very strong correltaion, they seem to capture much of the same variation within them - tax and (crim, indus, and nox) - ditto for rad - lstat and medv

4. Standardize and summarize

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

Let’s run the scale function on the dataset.

Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.

Create a categorical crime rate column

# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.

Create training and test datasets

set.seed(179693716) 
n <- nrow(Boston_scaled)
split <- sample(n,  size = n * 0.8)

train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]

Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.

5. Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2475248 0.2351485 0.2623762 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.85485581 -0.8956178 -0.08120770 -0.8436157  0.4413548 -0.8603410
## med_low  -0.09212737 -0.2763238  0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309  0.1791469  0.26643202  0.3710782  0.1054813  0.4474274
## high     -0.48724019  1.0170298 -0.04947434  1.0820880 -0.4230980  0.8279971
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8290333 -0.6897369 -0.7642825 -0.43374243  0.3887547 -0.77819536
## med_low   0.3902633 -0.5443053 -0.4238660 -0.08385135  0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126  0.1295279  0.06516648
## high     -0.8564037  1.6390172  1.5146914  0.78181164 -0.8395683  0.90684062
##                 medv
## low       0.52171440
## med_low  -0.04934231
## med_high  0.14052867
## high     -0.73164690
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12991855  0.717286290 -1.11409479
## indus    0.01318562 -0.275547338  0.12479393
## chas    -0.08510193 -0.049959856  0.17108503
## nox      0.50033989 -0.751262885 -1.27531579
## rm      -0.10015731 -0.108297341 -0.19190232
## age      0.21057791 -0.358696692 -0.14782657
## dis     -0.05526417 -0.347261139  0.38293342
## rad      3.16593191  1.032967549 -0.36643042
## tax     -0.01168751 -0.096843332  1.10072573
## ptratio  0.12524469  0.004459516 -0.41009220
## black   -0.12747580 -0.014860435  0.07573253
## lstat    0.22137355 -0.316638242  0.31488347
## medv     0.17976929 -0.444140572 -0.22238434
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9509 0.0340 0.0151

Biplot

arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = scale * heads[,choices[1]], 
         y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
  text(scale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot

fig. 5.1 LDA biplot

We see that out of the two first linear discriminants, LD1 nearly perfectly separates the data into two clusters: those with high crime rate, and those with other values. rad has the clearly highest coefficient in LD1, which can be seen both from the biplot and the LDA summary.

LD2 seems to find another axis within the data that explains a smaller effect. The largest coefficients in LD2 belong to nox, medv, and zn.

6. Validation in test set

# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)

Let’s predict the crime rate quartiles in the test set and cross tabulate:

# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)

# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
##           predicted
## correct    low med_low med_high high
##   low       15       9        0    0
##   med_low    5      13        8    0
##   med_high   1      10       19    1
##   high       0       0        0   21
nrow(test)
## [1] 102

and here’s the same table as a confusion matrix:

image(tab)
fig. 6.1 Confusion matrix of the LDA fit

fig. 6.1 Confusion matrix of the LDA fit

The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).

68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!

6.Extra: let’s try with only rad

lda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit

fig. 6.2 Confusion matrix with a single variable LDA fit

Using only rad gives us a model that seems to have exactly the same predictive power in the high vs not high case, but loses information in the lower quartiles. This fits with our earlier analysis of how LD1 was mostly rad and was able to carve out most of the high crime rate areas from the rest.

7. K-means

The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.

I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.

# determine the number of clusters
#k_max <- 10

# calculate the total within sum of squares, take 5 samples to stabilize the variance 
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})

df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))

df <- df %>% rowwise() %>% mutate(twcss = mean(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
  geom_line() +
  geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
  theme(legend.position="none") +
  scale_x_continuous(breaks=df$k) +
  scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot

fig. 7.1 k-means twcss plot

It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.

k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.

I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.

Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       28    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k2m$centers
##         crim         zn      indus         chas        nox         rm
## 1 -0.3894158  0.2621323 -0.6146857  0.002908943 -0.5823397  0.2446705
## 2  0.7238295 -0.4872402  1.1425514 -0.005407018  1.0824279 -0.4547830
##          age        dis        rad        tax    ptratio      black      lstat
## 1 -0.4331555  0.4540421 -0.5828749 -0.6291043 -0.2943707  0.3282754 -0.4530491
## 2  0.8051309 -0.8439539  1.0834228  1.1693521  0.5471636 -0.6101842  0.8421083
##         medv
## 1  0.3532917
## 2 -0.6566834

It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.

ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

It seems that the model has picked two clusters that have the following relative position to each other:

  • the red cluster has a much lower crime rate
  • the red cluster has a lower radial highway access index
  • the red cluster has a lower tax rate
  • the red cluster has a much higher proportion of black residents
  • the red cluster has a lower proportion of buildings built prior to 1940
  • the red cluster has a similar median value distribution, shifted towards a higher evaluation (excepting a blue bump right at the top of the evaluation range)
  • the blue cluster has a much smaller distance to employment centers
  • the blue cluster has a much smaller proportion of residential land zoned for large plots
  • the red cluster has a much smaller proportion of single room apartments and other small housing units
  • the red cluster has a smaller proportion of working class people and non-educated adults

So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.

So it seems like the red regions are new developments, new suburbs farther away from the city, and there may be some price discrimination in the house prices (medv) connected with the high proportion of black residents living there. You can see effects of segregationist US housing policy in the data. E.g., the 1949 housing act set up a framework to subsidize public housing for whites with clauses forbidding resale to black people, which means that black people paid more for housing (see e.g. “Abramovitz & Smith, The Persistence of Residential Segregation by Race, 1940 to 2010: The Role of Federal Housing Policy,Families in Society, Vol. 102, Issue 1” for more).

7.2 Let’s try with k=3

Why not try with k=3 for good measure? Maybe we can find additional structure in the data.

set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k3m$centers
##         crim         zn      indus        chas        nox         rm        age
## 1 -0.4076669  1.1526549 -0.9877755 -0.10115080 -0.9634859  0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208  0.07398993 -0.1662087 -0.1700456  0.1673019
## 3  0.8942488 -0.4872402  1.0913679 -0.01330932  1.1109351 -0.4609873  0.7828949
##           dis        rad        tax     ptratio      black       lstat
## 1  1.05650031 -0.5965522 -0.6837494 -0.62478941  0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655  0.2680397 -0.05818052
## 3 -0.84882600  1.3656860  1.3895093  0.63256391 -0.7083974  0.90799414
##          medv
## 1  0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.

The differences in the red and green clusters seem to be: - the red cluster has higher values of zn more big plots - the red cluster has a smaller proportion of older buildings - the red cluster consists of exclusively black neighbourhoods, while the green cluster has some spread - the green cluster is between the red and blue clusters when it comes to proportion of laborers and uneducated adults

Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.


Assignment 5: Dimensionality reduction

Brief from moodle

Actually, a fairly large selection of statistical methods can be listed under the title “dimensionality reduction techniques”. Most often (nearly always, that is!) the real-world phenomena are multidimensional: they may consist of not just two or three but 5 or 10 or 20 or 50 (or more) dimensions. Of course, we are living only in a three-dimensional (3D) world, so those multiple dimensions may really challenge our imagination. It would be easier to reduce the number of dimensions in one way or another.

We shall now learn the basics of two data science based ways of reducing the dimensions. The principal method here is principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. The most important components can be then used for various purposes, e.g., drawing scatterplots and other fancy graphs that would be quite impossible to achieve with the original variables and too many dimensions.

Multiple correspondence analysis (MCA) and other variations of CA bring us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. The typical graphs show the original classes of the discrete variables on the same “map”, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).

Briefly stated, these methods help to visualize and understand multidimensional phenomena by reducing their dimensionality that may first feel impossible to handle at all.

0. Setup, read the data

library(readr)
library(tibble)
library(GGally)
library(corrplot)
library(dplyr)
library(FactoMineR)
human0 <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)

1. Country names as rownames + summaries

  • Move the country names to rownames (see Exercise 5.5).
  • Show a graphical overview of the data and show summaries of the variables in the data.
  • Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

Rownames

human <- column_to_rownames(human0, "Country")

Summary, statistics

Let’s look at the data, starting with some descriptive statistics:

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Based on the difference between mean and median, and the ranges of the variables, Edu.Exp and Parli.FM seem the most normally distributed. Edu2.FM, Labo.FM and Life.Exp are more skewed or have long tails (bdifference between mean and median, and the ranges of the variables). GNI and Mat.Mor are closer to log-normal (or a related distribution), and maybe Ado.Birth as well.

Summary, graphical

corrplot(cor(human), method="circle",type = "upper", tl.pos = "d")
fig. 1.1, correlation matrix of variables

fig. 1.1, correlation matrix of variables

There are definitely strong correlations in the data. We can identify four different sets of variables from this matrix.

First, there is one set of very strongly correlated variables: Mat.Mor, Ado.Birth, Life.Exp, and Edu.Exp. These will likely be very strongly represented in the first principal component of a PCA.

Second, these four variables also correlate very strongly with GNI and Edu2.FM, more strongly than that pair of variables correlate between themselves. Since they correlate less amongst themselves, there seems to be more degrees of freedom here.

Then there is a third group, Labo.FM and Parli.F. They seem much less correlated compared to the other set of variables, however, they are most strongly correlated with each other. Again, seemingly more degrees of freedom in this group.

ggpairs(human, progress = FALSE)
fig. 1.2, distributions and scatterplots of variables

fig. 1.2, distributions and scatterplots of variables

The correlations become much more clear when looking at the scatterplots. Some are clearly linear, like Edu.Exp and Life.Exp. It’s even clearer that GNI might need to be log-transformed, based on the distribution and the scatter plot shapes. Possibly Mat.Mor and Ado.Birth too.

The other distributions look like skewed normal-like distributions, all of them have one big main mode, although there are some interesting concentrations in certain parts of the distributions.

E.g., in Life.Exp, there is a clear plateau from 60 to roughly 65 or so. This may be related to quality of life before and after retirement.

Another example is this curious smaller mode in the Edu2.FM around 0.5, and a subsequent drop after that. This suggests that there are two overlaid populations, one where the mode is a little under 1, another where the mode is around 0.5. I don’t have a hypothesis for what could be causing this difference in behaviors in the two populations of countries, but it’s a very interesting feature of the distribution. Unfortunately we’ve already thrown away the two variables that this variable is based on, which may have given us some clues to what this effect is.

Let’s do some transforms, just for fun

human.log <- human %>% mutate(GNI = log(GNI)) %>% mutate(Mat.Mor = log(Mat.Mor)) %>% mutate(Ado.Birth = log(Ado.Birth))
ggpairs(human.log, progress = FALSE)
fig. 1.3, distributions and scatterplots of variables after log transform

fig. 1.3, distributions and scatterplots of variables after log transform

The scatter plots now seem to form much clearer trend lines, which seems to indicate this was a good idea. Let’s recheck the correlation matrix, just to check that the scale of the data didn’t mix things up too much.

corrplot(cor(human.log), method="circle",type = "upper", tl.pos = "d")
fig. 1.4, correlation matrix using log transformed dataset

fig. 1.4, correlation matrix using log transformed dataset

Well, this does seem to have changed a lot of things. The correlations are now much more even among five variables: Life.Exp, Edu.Exp, GNI, Mat.Mor, and Ado.Birth. The correlation with Edu2.FM is a little weaker, and the Parli.F and Labo.FM variables are again least correlated with the others.

2. PCA with raw data

  • Perform principal component analysis (PCA) on the raw (non-standardized) human data.
  • Show the variability captured by the principal components.
  • Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

Since the log transform was not part of the assignment, let’s forget it for now, and do PCA on the raw dataset and look at the coefficients.

pca_raw <- prcomp(human)
pca_raw
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03

PC1 is nearly all GNI, PC2 is nearly all Mat.Mor, and PC3 is nearly all Ado.Birth, etc. This is a problem. Let’s see why by looking at the summary of the transformed rows, which are in pca_raw$x.

summary(pca_raw$x)
##       PC1               PC2               PC3               PC4         
##  Min.   :-105495   Min.   :-858.85   Min.   :-87.306   Min.   :-37.798  
##  1st Qu.:  -6885   1st Qu.: -75.30   1st Qu.:-11.498   1st Qu.: -6.643  
##  Median :   5587   Median :  60.76   Median :  2.681   Median :  1.787  
##  Mean   :      0   Mean   :   0.00   Mean   :  0.000   Mean   :  0.000  
##  3rd Qu.:  13430   3rd Qu.: 117.75   3rd Qu.: 13.592   3rd Qu.:  8.200  
##  Max.   :  17051   Max.   : 200.85   Max.   : 99.552   Max.   : 26.267  
##       PC5                PC6                PC7                PC8          
##  Min.   :-16.0452   Min.   :-4.38647   Min.   :-0.70772   Min.   :-0.43774  
##  1st Qu.: -2.0796   1st Qu.:-1.10824   1st Qu.:-0.08793   1st Qu.:-0.08919  
##  Median :  0.5794   Median : 0.07198   Median : 0.02775   Median :-0.01657  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:  2.3562   3rd Qu.: 1.09716   3rd Qu.: 0.10801   3rd Qu.: 0.08364  
##  Max.   :  7.6214   Max.   : 3.80771   Max.   : 0.80521   Max.   : 0.50052

As we saw earlier, GNI per capita goes from roughly 500 to 100000, while the other variables were all max in the hundreds. Since PCA will maximize the spread along each principal component, we will get some bad decisions, because the distance scales in these variables are not comparable (the dataset needs to be standardized).

If we look at a summary of the pca, this is even more clear:

summary(pca_raw)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

This tells us that PC1 captures nearly all of the variance (>0.9999) in the dataset, which again is due to the dataset not being standardized.

Knowing this, we can’t expect a very good biplot, but let’s plot one anyway.

pca_pr <- round(1*summary(pca_raw)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_raw, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("Negative GNI,", labels[1]), ylab = paste("Maternal mortality,",labels[2]))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
fig. 2.1, biplot for PCA of the raw data

fig. 2.1, biplot for PCA of the raw data

Again, we see that PC1 is pretty much only GNI. This is evident from the fact that the GNI arrow is the only visible one. This can also be inferred from looking at the order of the countries, high GNI countries are to the left, low GNI countries to the right.

I’ve named the principal components according to the variable that they’ve picked up from the dataset (negative GNI and Maternal Mortality), since each of the first few components are almost aligned with one dimension.

3. PCA with standardized variables

  • Standardize the variables in the human data and repeat the above analysis.
  • Interpret the results of both analysis (with and without standardizing).
  • Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

Let’s standardize, do a PCA, and look at the principal components.

pca_scaled <- human %>% scale %>% as.data.frame %>% prcomp
pca_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476

This already looks much better. Let’s see the spreads.

summary(pca_scaled$x)
##       PC1               PC2                PC3                PC4          
##  Min.   :-3.4128   Min.   :-2.79379   Min.   :-2.13797   Min.   :-3.43435  
##  1st Qu.:-1.4651   1st Qu.:-0.61041   1st Qu.:-0.61625   1st Qu.:-0.51644  
##  Median :-0.4934   Median : 0.04166   Median :-0.07731   Median : 0.05376  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.0237   3rd Qu.: 0.68895   3rd Qu.: 0.47541   3rd Qu.: 0.54936  
##  Max.   : 6.0717   Max.   : 3.12552   Max.   : 2.56743   Max.   : 2.27574  
##       PC5                PC6                PC7                PC8         
##  Min.   :-2.74409   Min.   :-1.49002   Min.   :-0.98931   Min.   :-1.0576  
##  1st Qu.:-0.33133   1st Qu.:-0.34286   1st Qu.:-0.29823   1st Qu.:-0.1581  
##  Median : 0.06288   Median : 0.02236   Median :-0.02144   Median : 0.0261  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.41848   3rd Qu.: 0.26156   3rd Qu.: 0.29026   3rd Qu.: 0.1657  
##  Max.   : 1.54550   Max.   : 1.82764   Max.   : 1.13872   Max.   : 1.3844

Ok, the range of the dimensions seem much more sensible now, they are all in the same order of magnitude. Let’s see how much variance each principal component explains.

summary(pca_scaled)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

PC1 explains more than half the variance, not bad! All principal components do seem to capture at least one percent of the variance however, so we weill be losing information if we decide to cut this off.

pca_pr <- round(1*summary(pca_scaled)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")

biplot(pca_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("neg. Human development,", labels[1]), ylab = paste("Gender equality,",labels[2]))
fig. 3.1, biplot for PCA of the scaled data

fig. 3.1, biplot for PCA of the scaled data

Much better, and better yet, all original variables are roughly axis-aligned. I’ve added descriptive labels based on which variables align with which axes, more on these in section 4.

3.1 Interpretation

Raw data vs scaled data

As already discussed, the raw data was a bad fit for PCA due to the different orders of magnitude in the dispersion among the variables. PCA on raw data essentially just picked out one variable at a time in decreasing order of scale.

PCA on the scaled data performs much better.

PC1 is neg. Human development

I’ve decided to name PC1 neg. Human development, inspired by the name of the original dataset. This principal component measures the welfare of the country in terms of health (Mat.Mor, Ado.Birth, Life.Exp), standard of living (GNI per cap., ppp adjusted), and education (Edu.Exp, Edu2.FM). The value of this component is smaller with better outcomes in these domains, which is where the neg. comes in. I would rather take the negative of this PC1 and call it Human development, but this is what the PCA gave us.

PC2 is gender equality

PC2 I’ve called gender equality, because it measures female participation in political decision-making (0.65 * Parli.F) and female participation in the labour market (0.72 * Labo.FM).

This is perhaps not the whole story, because this component has a positive contribution from maternal mortality rates and adolescent birth. Maybe this component only measures whether society has moved away from traditional gender roles. In that case, this can be seen as a cultural liberal/conservative axis.

PC3 is negative female political empowerment, or negative attitudes towards female leadership

PC3 is positively correlated with female participation in political decision-making (0.73 * Parli.F), but negatively correlated with the ratio of female participation in labor markets and secondary education (-0.584 * Labo.FM, -0.24 * Edu2.FM).

Let’s flip this around and consider NPC3 = -PC3, it’s easier to reason about that way. If a country has a high level of female MPs relative to the female education and participation in the labor market, then NPC3 is high.

So this principal component measures how women are viewed as in society. Are they seen as leaders (and elected into parliament)? then PC3 is low. Are they seen as useful in the workforce but not as leaders? then PC3 is high.

PC4 is difficult to interpret

The principal components are getting harder and harder to reason about as we go further down the list.

Roughly, PC4 = 0.62 Edu2.FM - 0.72 GNI - 0.25 Mat.Mor.

These variables don’t seem to make a clear story. Edu2.FM should be very close to 1 for all developed nations, as high school dropouts are rare, and with negative GNI per cap. maybe this axis is about the economic development of the country? The maternal mortality rate is difficult to square with this.

This component might measure the relative focus on wealth as compared to other societal welfare in the country. With heavy emphasis on might.

human_scaled_pca <- data.frame(pca_scaled$x)

index <- order(human_scaled_pca$PC4)
human_scaled_pca %>% arrange(desc(PC4))
##                                                   PC1          PC2          PC3
## Myanmar                                    0.22129643 -0.339858349 -2.137967976
## Gabon                                      0.38951619  0.539298671 -1.477157917
## Guyana                                     0.81614017 -0.005126477  0.855888106
## Libya                                     -1.53980422 -1.493748457  0.207781921
## Tajikistan                                 0.21994425 -0.299649920 -0.471229436
## Lesotho                                    2.41352165  0.966326284 -0.469152353
## Honduras                                   0.37144496 -0.477332449  0.893880234
## Nicaragua                                  0.33842550  0.570944859  1.600126090
## Jamaica                                   -0.06175153  0.015692724 -0.477625887
## Samoa                                     -0.49340583 -2.067158649 -0.048787759
## Moldova (Republic of)                     -0.14463141  0.316328188 -0.310093407
## Kyrgyzstan                                -0.09841728 -0.010254730  0.233419334
## Philippines                                0.25460237  0.009222189  0.528877460
## Belize                                     0.03821611 -0.834020935 -0.164341793
## Armenia                                   -0.41497907 -0.613047804 -0.668870242
## Bhutan                                     0.21621916 -0.205695481 -1.274792504
## Rwanda                                     1.08474351  3.125519996  1.399402307
## Dominican Republic                         0.15665142 -0.232069047 -0.005138172
## Namibia                                    0.58896079  1.429030129  0.561863707
## Georgia                                   -0.38211425 -0.445864707 -0.649432206
## Tonga                                     -0.45134514 -1.165949078 -1.371247180
## Ecuador                                   -0.48514379  1.062937942  1.432626667
## Mongolia                                  -0.58002389  0.031790917 -0.819700835
## Venezuela (Bolivarian Republic of)        -0.35614736 -0.328185500 -0.262447337
## Fiji                                      -0.54064402 -1.018100867  0.043783998
## Colombia                                  -0.24456731 -0.012229219 -0.017543050
## Ukraine                                   -0.61256414 -0.236511138 -0.828004404
## Guatemala                                  0.92978002 -1.021873670  0.076300346
## Barbados                                  -0.83161287  0.528012907 -0.591747431
## Cuba                                      -0.85690790  1.206502306  2.170553442
## Brazil                                    -0.52713454 -0.437211056 -0.914408365
## South Africa                               0.52541319  1.290788972  0.929186725
## Costa Rica                                -0.88088858  0.247783224  1.177525013
## Panama                                    -0.52104983 -0.445966480  0.109079216
## Viet Nam                                  -0.12772727  0.635409753 -0.023983110
## Albania                                   -0.67704042 -0.334378287  0.233899467
## Uruguay                                   -1.10762421 -0.407424570 -0.791498107
## Bangladesh                                 1.17163475 -0.264831864  0.284988061
## Azerbaijan                                -0.13345597  0.270223921 -0.917559403
## Sri Lanka                                 -0.75648305 -1.887423148 -0.154346306
## Bulgaria                                  -0.83448409  0.297191795 -0.354609758
## El Salvador                                0.31502973 -0.082434305  0.932630500
## Bahamas                                   -0.80193053  0.256043765 -0.862499787
## Paraguay                                   0.44428803 -0.500263835  0.076871476
## Swaziland                                  2.33315345 -0.596949105 -0.306431397
## Bolivia (Plurinational State of)           0.61863947  2.118079004  1.809884477
## Congo                                      2.50890521  0.529214816 -1.324187811
## Belarus                                   -1.09207196  0.821722632  0.249889580
## Kenya                                      2.23833543  0.681110048 -0.446481460
## Botswana                                   0.51070846 -0.005038506 -1.402977382
## Thailand                                  -0.32856107 -0.610848412 -1.075159434
## Peru                                      -0.08991745  0.379961425 -0.057040060
## Zimbabwe                                   2.24835708  1.728001429  0.178907487
## Mexico                                    -0.53292222  0.345557424  1.529477321
## Kazakhstan                                -0.75696807  0.549022028 -0.685887511
## Uganda                                     2.95628470  1.805866347  0.326356280
## Suriname                                  -0.12432505 -0.997314985 -0.287838649
## Latvia                                    -1.31130851  0.187341613 -0.642976519
## Montenegro                                -1.11686706 -0.122488486 -0.280938313
## Romania                                   -0.77188498 -0.403249223 -0.702527677
## Jordan                                    -0.57591739 -2.362089025  0.910026604
## Syrian Arab Republic                       0.36377295 -2.559738024  1.335263924
## Russian Federation                        -0.80811646 -0.054778581 -0.829752646
## Lebanon                                   -1.23268082 -2.526681094  0.043940503
## China                                     -0.67672686  0.359953468  0.049521929
## Maldives                                  -0.67904281 -0.989578003 -0.802207742
## Hungary                                   -1.36809208 -0.493681621 -0.942204385
## Chile                                     -1.22310783 -0.447088566 -0.163992357
## Serbia                                    -0.83377077  0.705073607  0.951750046
## Iran (Islamic Republic of)                -0.96436342 -2.793788341  0.315434603
## Indonesia                                  0.48036817 -0.567596783  0.116329285
## Croatia                                   -1.30589585  0.406709990  0.159833574
## Estonia                                   -1.68580633  0.371211438 -0.569235608
## Slovakia                                  -1.48297697 -0.019596892 -0.401935798
## Argentina                                 -1.45601757  0.838984586  1.037680344
## Trinidad and Tobago                       -0.43657185  0.131125955  0.074004863
## Portugal                                  -1.99792574  1.040595945  0.180046605
## Ghana                                      2.01345921  0.355485039 -1.241712687
## Lithuania                                 -1.47424792  0.607825365 -0.365058067
## Algeria                                   -0.88489346 -1.630270510  1.842073337
## Czech Republic                            -1.94024133  0.041663314 -0.421174852
## Poland                                    -1.51988207  0.193167920 -0.111253539
## Malaysia                                  -0.92235926 -0.984616060 -0.085310083
## The former Yugoslav Republic of Macedonia -0.62281995  0.263132896  1.292063415
## Haiti                                      2.36064900 -0.561384801 -1.290302990
## Sudan                                      2.60420748 -1.045381397  1.334927961
## Mauritius                                 -0.79068829 -0.937216376 -0.221835304
## Tanzania (United Republic of)              2.88648521  1.899305596  0.487200420
## Slovenia                                  -2.16450489  0.836704528 -0.077310885
## Egypt                                      0.07863364 -2.585345419  0.217108451
## Morocco                                    0.36411731 -2.047307836  0.794721315
## Zambia                                     2.37579379  0.275956183 -0.731488256
## Papua New Guinea                           2.24414173 -0.288614274 -1.445469022
## Tunisia                                   -0.84854176 -0.830766824  1.960056511
## Gambia                                     3.41276063 -0.007016863 -0.901829512
## Cyprus                                    -1.52126616 -0.301457929 -0.810134452
## Bosnia and Herzegovina                    -0.49234291 -0.706061064  0.646486710
## Israel                                    -2.08527706  0.550448154 -0.402495966
## Bahrain                                   -1.83627456 -1.277276062 -0.005796514
## Greece                                    -1.99151909  0.052754647 -0.028068250
## New Zealand                               -2.49492486  1.269102839  0.014971514
## Cameroon                                   3.50119915  1.065163090  0.063364397
## Malawi                                     3.49174202  1.207150760 -0.936231767
## Nepal                                      1.43508146  1.194786495  0.463613461
## Burundi                                    2.92716442  2.104937966 -0.100589338
## Iceland                                   -2.74910896  2.053637043  0.419653402
## Senegal                                    2.60628829  1.250222087  1.794207914
## Malta                                     -1.44928760 -1.028809545 -0.081348924
## Japan                                     -2.20169826 -0.609972749 -0.751252991
## Finland                                   -2.48617797  1.892468888  0.594038932
## Cambodia                                   1.49861802  0.434332761 -0.135571575
## Mali                                       4.42864302 -0.725923093 -0.189211918
## Spain                                     -2.33209334  1.348636610  0.700913469
## United Kingdom                            -2.05713992  0.589584266 -0.407850345
## Ireland                                   -2.62808538  0.362020337 -0.640755797
## Benin                                      2.85453769 -0.106611646 -0.837692019
## Ethiopia                                   2.92275146  0.800047424  0.272497038
## Canada                                    -2.26302904  1.033411190 -0.292957486
## Italy                                     -2.18260449  0.338893545  0.646989575
## Iraq                                       0.96258010 -1.699941581  2.252525442
## France                                    -2.19885491  0.696790265 -0.217303004
## Turkey                                    -0.57530772 -1.527893514  0.722212692
## Austria                                   -2.38403803  0.902207024 -0.003026204
## India                                      1.10413681 -1.992805576  1.003863214
## Pakistan                                   1.72263666 -1.938459968  1.749897520
## Togo                                       2.92936876  1.008638618 -0.699996373
## Germany                                   -2.51325622  1.320373244  0.388898222
## Australia                                 -2.99370191  1.188201602 -0.106765315
## Sweden                                    -2.51286376  1.963931278  0.595735063
## Belgium                                   -2.35585306  1.588259556  0.847919087
## Denmark                                   -2.78120158  1.773429808  0.171302280
## Korea (Republic of)                       -2.10167572 -0.277374746 -0.352782824
## Mauritania                                 2.60820181 -1.330052714  1.623557480
## United Arab Emirates                      -2.14191598 -0.901306523 -0.292906252
## Oman                                      -1.29017323 -2.054918869  0.293468342
## Netherlands                               -2.72716147  1.469105166  0.308130817
## Côte d'Ivoire                              4.64084255 -0.598211941 -0.459138696
## Burkina Faso                               3.94333744  0.053955410 -0.319270357
## Liberia                                    3.96747850  0.317838331 -0.863492996
## Congo (Democratic Republic of the)         4.43880394  0.539558045 -1.303548841
## Mozambique                                 4.24412061  2.357683071  0.761360284
## Yemen                                      2.36569153 -2.652557169  0.382761370
## Niger                                      5.39088099 -1.241241283  0.873413438
## United States                             -2.15666301  0.461289598 -0.876464679
## Central African Republic                   5.15979806  0.299745466 -0.827624027
## Sierra Leone                               5.29969299  0.894599149 -1.383380245
## Switzerland                               -2.69683318  0.896899989 -0.275048181
## Luxembourg                                -2.34430026  0.673760445 -0.178640850
## Afghanistan                                3.17249375 -1.530064615  2.567425390
## Norway                                    -3.08378347  1.890281525  0.056969733
## Saudi Arabia                              -2.04814861 -1.613019142  0.881908713
## Chad                                       6.07170782  0.419662712 -0.379782202
## Singapore                                 -2.86994888  0.516674990 -0.429116560
## Kuwait                                    -2.34123467 -1.651753979 -1.483382891
## Qatar                                     -3.41275720 -1.651554785 -2.092935891
##                                                    PC4          PC5
## Myanmar                                    2.275737391 -0.784101836
## Gabon                                      1.778909476 -2.181779590
## Guyana                                     1.490657114 -1.914978588
## Libya                                      1.311190236 -0.777481767
## Tajikistan                                 1.280122465  0.062883028
## Lesotho                                    1.226549015 -1.630682971
## Honduras                                   1.206509214 -1.014293478
## Nicaragua                                  1.100369752 -1.082146297
## Jamaica                                    1.044404927 -0.227410096
## Samoa                                      1.016812799 -0.110042154
## Moldova (Republic of)                      0.994627115  0.470592772
## Kyrgyzstan                                 0.984423355  0.253963473
## Philippines                                0.943707745 -0.508529110
## Belize                                     0.905689903 -0.354976024
## Armenia                                    0.845581564  0.440952425
## Bhutan                                     0.843768126  0.344669908
## Rwanda                                     0.823466529 -0.094488185
## Dominican Republic                         0.820699804 -0.898427787
## Namibia                                    0.814718726 -0.404258828
## Georgia                                    0.800279490  0.377801426
## Tonga                                      0.781942289  0.749978939
## Ecuador                                    0.768903920 -0.699510243
## Mongolia                                   0.758449744  0.440072188
## Venezuela (Bolivarian Republic of)         0.758276118 -0.904889663
## Fiji                                       0.751446198  0.004254590
## Colombia                                   0.739148454 -0.404110156
## Ukraine                                    0.735815695  0.683855615
## Guatemala                                  0.732023267 -0.737883028
## Barbados                                   0.716841534  0.216873945
## Cuba                                       0.652802978 -0.141187770
## Brazil                                     0.635546781 -0.242856529
## South Africa                               0.625719759 -0.626998851
## Costa Rica                                 0.612051066 -0.448031363
## Panama                                     0.606356209 -0.882843754
## Viet Nam                                   0.603302121  0.872257041
## Albania                                    0.600494707  0.497187324
## Uruguay                                    0.599660130 -0.231245152
## Bangladesh                                 0.581364674 -0.152520365
## Azerbaijan                                 0.560060281  0.186188142
## Sri Lanka                                  0.538663748  0.378862799
## Bulgaria                                   0.529608478  0.276207632
## El Salvador                                0.498582285 -0.209535347
## Bahamas                                    0.488197758  0.057513980
## Paraguay                                   0.471807391  0.004514533
## Swaziland                                  0.453557485 -0.664998668
## Bolivia (Plurinational State of)           0.399174576 -0.250247599
## Congo                                      0.398128908 -0.669796656
## Belarus                                    0.394723997  0.361435870
## Kenya                                      0.390264593 -0.349458967
## Botswana                                   0.355643457  0.015774383
## Thailand                                   0.347688665  0.621808637
## Peru                                       0.346459243  0.357781868
## Zimbabwe                                   0.329045089 -0.135099353
## Mexico                                     0.313139790 -0.529001350
## Kazakhstan                                 0.308198625  0.224421990
## Uganda                                     0.305231215 -0.547035571
## Suriname                                   0.298543025 -0.101737220
## Latvia                                     0.278702422  0.357077524
## Montenegro                                 0.271709792  0.787994084
## Romania                                    0.270762447  0.358282124
## Jordan                                     0.239258300 -0.030840898
## Syrian Arab Republic                       0.236685678  0.186780574
## Russian Federation                         0.230843800  0.198560713
## Lebanon                                    0.219929309  0.207450407
## China                                      0.207098455  0.948090559
## Maldives                                   0.204993662  1.087492380
## Hungary                                    0.190542428  0.422324617
## Chile                                      0.151604505 -0.029232073
## Serbia                                     0.150877958  0.741356169
## Iran (Islamic Republic of)                 0.145504912 -0.038590210
## Indonesia                                  0.140336726  0.123615819
## Croatia                                    0.137752539  0.539931997
## Estonia                                    0.125625669  0.348079365
## Slovakia                                   0.108807736  0.184357318
## Argentina                                  0.100541901 -0.382426961
## Trinidad and Tobago                        0.094363327 -0.478807410
## Portugal                                   0.070353716  0.343488087
## Ghana                                      0.053762696  0.570568434
## Lithuania                                  0.038625393  0.469887414
## Algeria                                    0.036027293 -0.035468371
## Czech Republic                             0.034533898  0.396983608
## Poland                                     0.028293784  0.478839786
## Malaysia                                  -0.007775677  0.250309695
## The former Yugoslav Republic of Macedonia -0.011789161  0.757393781
## Haiti                                     -0.018017361  0.819230616
## Sudan                                     -0.030656873 -0.712459363
## Mauritius                                 -0.053895731  0.417082038
## Tanzania (United Republic of)             -0.057669838 -0.228794202
## Slovenia                                  -0.070286122  0.519533942
## Egypt                                     -0.084109671  0.384246559
## Morocco                                   -0.091930219  0.468826463
## Zambia                                    -0.098419324  0.125849595
## Papua New Guinea                          -0.125337620  1.185766443
## Tunisia                                   -0.167303534  0.654406033
## Gambia                                    -0.172392139  0.044491812
## Cyprus                                    -0.181589001  0.539000803
## Bosnia and Herzegovina                    -0.182925626  1.209146178
## Israel                                    -0.191614837  0.468282001
## Bahrain                                   -0.214807455 -0.784249226
## Greece                                    -0.224756734  0.689741659
## New Zealand                               -0.241859264  0.196214374
## Cameroon                                  -0.260759929 -0.555844104
## Malawi                                    -0.308982421  0.054072818
## Nepal                                     -0.316593826  1.114794397
## Burundi                                   -0.323059706  0.419879197
## Iceland                                   -0.327005382  0.264383077
## Senegal                                   -0.333524119  0.042331771
## Malta                                     -0.354324825  0.284860129
## Japan                                     -0.370785436  0.142433142
## Finland                                   -0.402199400 -0.035148222
## Cambodia                                  -0.403276572  1.545495585
## Mali                                      -0.407128283 -1.027734140
## Spain                                     -0.408851517  0.350774791
## United Kingdom                            -0.430565377 -0.121151666
## Ireland                                   -0.474385756  0.158551905
## Benin                                     -0.510128117  0.862388776
## Ethiopia                                  -0.522758647  0.638178331
## Canada                                    -0.536756869 -0.054260003
## Italy                                     -0.553844809  0.335992609
## Iraq                                      -0.565370695 -0.287911201
## France                                    -0.570782466  0.293158744
## Turkey                                    -0.590951238  0.618018515
## Austria                                   -0.615634290 -0.107843670
## India                                     -0.628896340  0.852767619
## Pakistan                                  -0.633197073  0.676731900
## Togo                                      -0.644239048  0.928867228
## Germany                                   -0.648265885 -0.140054500
## Australia                                 -0.667575221  0.146603167
## Sweden                                    -0.672261525 -0.206796715
## Belgium                                   -0.676265477 -0.050602376
## Denmark                                   -0.683471922  0.056081866
## Korea (Republic of)                       -0.691151610  0.615032289
## Mauritania                                -0.732229396  0.149028861
## United Arab Emirates                      -0.773527981 -1.767659249
## Oman                                      -0.795461293 -0.057279210
## Netherlands                               -0.797993542 -0.024473662
## Côte d'Ivoire                             -0.803721453 -0.656805429
## Burkina Faso                              -0.808281584  0.585365213
## Liberia                                   -0.821010099  0.337292077
## Congo (Democratic Republic of the)        -0.866980658  0.149837925
## Mozambique                                -0.926674225  0.313358992
## Yemen                                     -0.949498266  1.002500642
## Niger                                     -0.965501357 -1.303263726
## United States                             -0.972954474 -0.523731016
## Central African Republic                  -1.039430441 -0.027250931
## Sierra Leone                              -1.136109156 -0.260135837
## Switzerland                               -1.166378061 -0.293310010
## Luxembourg                                -1.167750947 -0.659139954
## Afghanistan                               -1.307267500  0.171076164
## Norway                                    -1.429085734 -0.634591949
## Saudi Arabia                              -1.483999840 -0.771322474
## Chad                                      -1.710289949 -0.313209384
## Singapore                                 -2.145302064 -0.773851494
## Kuwait                                    -2.268698795 -1.459482728
## Qatar                                     -3.434350558 -2.744086530
##                                                     PC6          PC7
## Myanmar                                    1.7676007033 -0.806312423
## Gabon                                     -0.1202364303  0.236755731
## Guyana                                     0.3639403210 -0.027849786
## Libya                                      0.9330807855  0.570157750
## Tajikistan                                 0.3475875471 -0.503606390
## Lesotho                                    1.0758622144  1.138717611
## Honduras                                  -0.3160976543 -0.377880488
## Nicaragua                                 -0.6811130040 -0.518838884
## Jamaica                                   -0.5596006430 -0.552066779
## Samoa                                      0.1862354772  0.197408492
## Moldova (Republic of)                      0.3856501921 -0.575902862
## Kyrgyzstan                                 0.4470455299 -0.082545945
## Philippines                                0.5735628979 -0.188977457
## Belize                                    -0.5257968357  0.220621752
## Armenia                                    0.1512678759 -0.513028090
## Bhutan                                     0.1354618087 -0.160322615
## Rwanda                                     1.4882291993 -0.291101527
## Dominican Republic                        -0.9977964148 -0.136457443
## Namibia                                    0.6531061286 -0.279270660
## Georgia                                   -0.4329675048 -0.129134324
## Tonga                                      0.0223647898  0.463117389
## Ecuador                                   -0.6229513015  0.060219694
## Mongolia                                   0.3664312347  0.382144832
## Venezuela (Bolivarian Republic of)        -0.8202034504  0.165151944
## Fiji                                      -0.2586706159  0.954688210
## Colombia                                  -0.5193595878 -0.089128360
## Ukraine                                   -0.0402324586  0.377186088
## Guatemala                                 -0.6663882061 -0.519546181
## Barbados                                  -0.5554976971  0.143179329
## Cuba                                      -0.1046529231 -0.129211242
## Brazil                                    -0.9387997247  0.259841102
## South Africa                               0.8641957710  0.880893184
## Costa Rica                                -0.5801338772 -0.205490683
## Panama                                    -0.7794109136 -0.261632639
## Viet Nam                                   0.1370254776 -0.819669561
## Albania                                    0.3719985522 -0.758087155
## Uruguay                                   -0.8520051813  0.123425781
## Bangladesh                                -0.2801004193 -0.761911579
## Azerbaijan                                 0.2228914450 -0.691575682
## Sri Lanka                                  0.1134797992  0.181878318
## Bulgaria                                  -0.1548721090 -0.094755523
## El Salvador                               -0.5208362283 -0.298642007
## Bahamas                                    0.2221208500 -0.684132379
## Paraguay                                  -0.3868646873 -0.383448677
## Swaziland                                  0.9379950756  1.083913498
## Bolivia (Plurinational State of)          -0.0005896796  0.218008164
## Congo                                     -0.7300608217  0.035047723
## Belarus                                    0.2002805861  0.442680455
## Kenya                                      0.0003908410  0.192825515
## Botswana                                   0.4208028626  0.051944196
## Thailand                                  -0.3728835623 -0.321836338
## Peru                                      -0.2954092927 -0.346454974
## Zimbabwe                                   1.0326915615  0.484252449
## Mexico                                    -0.3750194353 -0.284486795
## Kazakhstan                                 0.1478071951  0.258894510
## Uganda                                    -0.2663511002 -0.222926610
## Suriname                                   0.2671143004  0.054969268
## Latvia                                     0.1885997063  0.131359913
## Montenegro                                -0.0735326488  0.115347703
## Romania                                   -0.1404008807 -0.081718774
## Jordan                                     0.0960728327  0.426350376
## Syrian Arab Republic                       0.0533687969  0.381438615
## Russian Federation                         0.1868210793  0.214125170
## Lebanon                                   -0.0082757639  0.066977961
## China                                      0.3732800063 -0.461876446
## Maldives                                   0.2331259089 -0.434462712
## Hungary                                    0.0530148587  0.202738900
## Chile                                     -0.9916455370 -0.141135886
## Serbia                                     0.1960956378  0.001140764
## Iran (Islamic Republic of)                -0.3591669622  0.722852945
## Indonesia                                  0.0697247267  0.318509670
## Croatia                                    0.1183552755 -0.081785909
## Estonia                                   -0.1433648078  0.319266420
## Slovakia                                   0.0857885836  0.030612783
## Argentina                                 -0.7562700661  0.998335318
## Trinidad and Tobago                        0.5483599732 -0.287201887
## Portugal                                  -0.1179107107  0.036420328
## Ghana                                      0.3721749488  0.232143202
## Lithuania                                  0.1583897022  0.463060174
## Algeria                                    0.4902133796  0.630600521
## Czech Republic                             0.0054545873  0.274517961
## Poland                                     0.0146512625  0.078155353
## Malaysia                                   0.5943478473 -0.310882005
## The former Yugoslav Republic of Macedonia  0.2271429608 -0.225518600
## Haiti                                      0.8528400232 -0.487249059
## Sudan                                      0.6102373647 -0.583319974
## Mauritius                                 -0.3224620116  0.555744456
## Tanzania (United Republic of)             -0.4568728938 -0.686533066
## Slovenia                                   0.0215875237  0.200046218
## Egypt                                     -0.3163885313  0.410135541
## Morocco                                    0.0385729540 -0.140453926
## Zambia                                    -1.2011155278  0.645660843
## Papua New Guinea                           0.0929917875 -0.590106926
## Tunisia                                    0.3893479627  0.555015084
## Gambia                                    -0.3474528468 -0.395655997
## Cyprus                                     0.1192318589 -0.540291494
## Bosnia and Herzegovina                    -0.0270242943 -0.181398112
## Israel                                    -0.1633957692 -0.182073253
## Bahrain                                    0.3793530928  0.116840333
## Greece                                    -0.4677522487  0.508801969
## New Zealand                               -0.7465178815  0.702424299
## Cameroon                                   0.0737457183  0.614653903
## Malawi                                    -1.1968378168 -0.088176314
## Nepal                                     -0.5593956686 -0.297819695
## Burundi                                    1.8276366746  0.677851002
## Iceland                                   -0.3893837420  0.527074951
## Senegal                                    0.1229101109 -0.989313955
## Malta                                     -0.1902293297 -0.234558008
## Japan                                     -0.1092196560 -0.295650382
## Finland                                    0.0500578223  0.125708116
## Cambodia                                   0.0893450889 -0.641196445
## Mali                                      -1.1754044432 -0.019086504
## Spain                                     -0.2384440630  0.193672897
## United Kingdom                            -0.3405838816 -0.083078182
## Ireland                                   -0.3678977300  0.633980162
## Benin                                     -0.3370880163  0.133803308
## Ethiopia                                   0.2799208080 -0.661627350
## Canada                                    -0.0899079143 -0.292960474
## Italy                                     -0.0518444082 -0.045035198
## Iraq                                      -0.0205434516 -0.377293351
## France                                    -0.0424944013 -0.194411707
## Turkey                                    -0.3388083756  0.259713548
## Austria                                    0.2005291571 -0.252609069
## India                                      0.2976255773  0.295702180
## Pakistan                                   1.0547918604 -0.663717157
## Togo                                      -0.3551166163  0.469693499
## Germany                                    0.1990240456 -0.001114540
## Australia                                 -0.6199683954  0.917438777
## Sweden                                     0.2506794343 -0.372812563
## Belgium                                    0.1848918584 -0.041147418
## Denmark                                   -0.0930002006  0.544652022
## Korea (Republic of)                       -0.2412194847  0.254405229
## Mauritania                                 0.3835135567 -0.206485725
## United Arab Emirates                       0.4626591783 -0.456221585
## Oman                                       0.2559958563 -0.052902782
## Netherlands                               -0.0840485953  0.299683270
## Côte d'Ivoire                              0.0332611178  0.771694885
## Burkina Faso                              -0.3017379532 -0.688616909
## Liberia                                   -0.4465735263  0.044506003
## Congo (Democratic Republic of the)        -0.6733926462  0.284824245
## Mozambique                                -0.4315443547 -0.191228703
## Yemen                                      0.3877720692 -0.105861134
## Niger                                     -1.4900241152 -0.772778138
## United States                             -0.3451427925 -0.002143577
## Central African Republic                   0.9515520853  0.428317282
## Sierra Leone                               0.9362103161  1.063217796
## Switzerland                                0.1865273520 -0.420485693
## Luxembourg                                 0.4269134315 -0.843634606
## Afghanistan                                0.1650292048  0.419870531
## Norway                                     0.1057068050 -0.021435945
## Saudi Arabia                               0.2995446875  0.814349653
## Chad                                      -0.1384298783  0.505501713
## Singapore                                  0.2087344376 -0.631938288
## Kuwait                                     0.4433858104 -0.154579254
## Qatar                                      0.7794028704 -0.869044046
##                                                    PC8
## Myanmar                                    0.464249830
## Gabon                                     -0.016809690
## Guyana                                     0.283691259
## Libya                                      0.150608995
## Tajikistan                                -0.216743506
## Lesotho                                   -0.302736425
## Honduras                                   0.264847993
## Nicaragua                                  0.168834839
## Jamaica                                    0.212157687
## Samoa                                      0.295420249
## Moldova (Republic of)                     -0.213918903
## Kyrgyzstan                                -0.042950292
## Philippines                               -0.097634576
## Belize                                    -0.319139671
## Armenia                                    0.092802384
## Bhutan                                    -0.074744150
## Rwanda                                     0.123998389
## Dominican Republic                        -0.001090328
## Namibia                                   -0.508079106
## Georgia                                    0.042053001
## Tonga                                      0.306781672
## Ecuador                                    0.136462760
## Mongolia                                  -0.229052691
## Venezuela (Bolivarian Republic of)         0.135464359
## Fiji                                      -0.192307926
## Colombia                                   0.040791381
## Ukraine                                   -0.298001037
## Guatemala                                  0.082547026
## Barbados                                   0.027150536
## Cuba                                       0.481755252
## Brazil                                    -0.015904503
## South Africa                              -1.053520373
## Costa Rica                                 0.303047765
## Panama                                     0.311755507
## Viet Nam                                   0.087508486
## Albania                                    0.295251360
## Uruguay                                    0.027424769
## Bangladesh                                 0.135731141
## Azerbaijan                                -0.410686138
## Sri Lanka                                  0.190906050
## Bulgaria                                  -0.221248901
## El Salvador                               -0.142869401
## Bahamas                                    0.017655734
## Paraguay                                   0.035958347
## Swaziland                                 -1.057558817
## Bolivia (Plurinational State of)          -0.183113873
## Congo                                     -0.017934893
## Belarus                                   -0.466662216
## Kenya                                      0.026234807
## Botswana                                  -0.413437307
## Thailand                                  -0.146109071
## Peru                                       0.026102641
## Zimbabwe                                   0.016998870
## Mexico                                     0.070725796
## Kazakhstan                                -0.575145605
## Uganda                                    -0.572916878
## Suriname                                   0.088444159
## Latvia                                    -0.169022123
## Montenegro                                -0.024931977
## Romania                                   -0.075240568
## Jordan                                     0.176778023
## Syrian Arab Republic                      -0.199383676
## Russian Federation                        -0.481743543
## Lebanon                                    0.488700890
## China                                      0.025315956
## Maldives                                   0.197975099
## Hungary                                   -0.059973220
## Chile                                      0.323884955
## Serbia                                    -0.152580467
## Iran (Islamic Republic of)                 0.148533006
## Indonesia                                  0.017986999
## Croatia                                    0.040691396
## Estonia                                   -0.047421539
## Slovakia                                  -0.049162831
## Argentina                                  0.001276173
## Trinidad and Tobago                       -0.252645792
## Portugal                                   0.229393237
## Ghana                                     -0.037526847
## Lithuania                                 -0.330632189
## Algeria                                    0.366520434
## Czech Republic                             0.125812767
## Poland                                    -0.006129686
## Malaysia                                   0.025054670
## The former Yugoslav Republic of Macedonia -0.140384418
## Haiti                                      0.202062572
## Sudan                                      0.195483201
## Mauritius                                  0.021314559
## Tanzania (United Republic of)              0.031354925
## Slovenia                                   0.201136871
## Egypt                                     -0.229456272
## Morocco                                    0.283122879
## Zambia                                    -0.794475565
## Papua New Guinea                          -0.568359399
## Tunisia                                    0.085563570
## Gambia                                    -0.154919425
## Cyprus                                     0.238419359
## Bosnia and Herzegovina                    -0.037480926
## Israel                                     0.298293132
## Bahrain                                    0.116941358
## Greece                                     0.204463481
## New Zealand                                0.134924597
## Cameroon                                   0.012924475
## Malawi                                     0.078962173
## Nepal                                     -0.312497632
## Burundi                                    0.872691209
## Iceland                                    0.162465364
## Senegal                                   -0.050182834
## Malta                                      0.261913162
## Japan                                      0.466519718
## Finland                                    0.056001135
## Cambodia                                  -0.343470935
## Mali                                      -0.051504428
## Spain                                      0.220613119
## United Kingdom                             0.075500247
## Ireland                                    0.142599173
## Benin                                     -0.550244820
## Ethiopia                                   0.123086514
## Canada                                     0.172678430
## Italy                                      0.341553582
## Iraq                                      -0.466207100
## France                                     0.236234797
## Turkey                                    -0.161342462
## Austria                                    0.146278110
## India                                     -0.049179210
## Pakistan                                  -0.193259634
## Togo                                      -0.266806482
## Germany                                    0.085755248
## Australia                                  0.122261331
## Sweden                                     0.117156987
## Belgium                                    0.045294793
## Denmark                                   -0.054496911
## Korea (Republic of)                        0.313970540
## Mauritania                                -0.121129103
## United Arab Emirates                      -0.094267148
## Oman                                       0.014389692
## Netherlands                                0.062106304
## Côte d'Ivoire                              0.135308345
## Burkina Faso                              -0.537702988
## Liberia                                    0.479166548
## Congo (Democratic Republic of the)         0.535636325
## Mozambique                                -0.793719582
## Yemen                                     -0.164016287
## Niger                                      0.358842606
## United States                             -0.117040628
## Central African Republic                   0.645440709
## Sierra Leone                               1.384425435
## Switzerland                                0.149931235
## Luxembourg                                 0.088632850
## Afghanistan                               -0.184925327
## Norway                                    -0.123820241
## Saudi Arabia                              -0.334517472
## Chad                                       0.720298478
## Singapore                                 -0.058473974
## Kuwait                                    -0.577682871
## Qatar                                     -0.544730977
human_scaled_pca %>% arrange(PC4)
##                                                   PC1          PC2          PC3
## Qatar                                     -3.41275720 -1.651554785 -2.092935891
## Kuwait                                    -2.34123467 -1.651753979 -1.483382891
## Singapore                                 -2.86994888  0.516674990 -0.429116560
## Chad                                       6.07170782  0.419662712 -0.379782202
## Saudi Arabia                              -2.04814861 -1.613019142  0.881908713
## Norway                                    -3.08378347  1.890281525  0.056969733
## Afghanistan                                3.17249375 -1.530064615  2.567425390
## Luxembourg                                -2.34430026  0.673760445 -0.178640850
## Switzerland                               -2.69683318  0.896899989 -0.275048181
## Sierra Leone                               5.29969299  0.894599149 -1.383380245
## Central African Republic                   5.15979806  0.299745466 -0.827624027
## United States                             -2.15666301  0.461289598 -0.876464679
## Niger                                      5.39088099 -1.241241283  0.873413438
## Yemen                                      2.36569153 -2.652557169  0.382761370
## Mozambique                                 4.24412061  2.357683071  0.761360284
## Congo (Democratic Republic of the)         4.43880394  0.539558045 -1.303548841
## Liberia                                    3.96747850  0.317838331 -0.863492996
## Burkina Faso                               3.94333744  0.053955410 -0.319270357
## Côte d'Ivoire                              4.64084255 -0.598211941 -0.459138696
## Netherlands                               -2.72716147  1.469105166  0.308130817
## Oman                                      -1.29017323 -2.054918869  0.293468342
## United Arab Emirates                      -2.14191598 -0.901306523 -0.292906252
## Mauritania                                 2.60820181 -1.330052714  1.623557480
## Korea (Republic of)                       -2.10167572 -0.277374746 -0.352782824
## Denmark                                   -2.78120158  1.773429808  0.171302280
## Belgium                                   -2.35585306  1.588259556  0.847919087
## Sweden                                    -2.51286376  1.963931278  0.595735063
## Australia                                 -2.99370191  1.188201602 -0.106765315
## Germany                                   -2.51325622  1.320373244  0.388898222
## Togo                                       2.92936876  1.008638618 -0.699996373
## Pakistan                                   1.72263666 -1.938459968  1.749897520
## India                                      1.10413681 -1.992805576  1.003863214
## Austria                                   -2.38403803  0.902207024 -0.003026204
## Turkey                                    -0.57530772 -1.527893514  0.722212692
## France                                    -2.19885491  0.696790265 -0.217303004
## Iraq                                       0.96258010 -1.699941581  2.252525442
## Italy                                     -2.18260449  0.338893545  0.646989575
## Canada                                    -2.26302904  1.033411190 -0.292957486
## Ethiopia                                   2.92275146  0.800047424  0.272497038
## Benin                                      2.85453769 -0.106611646 -0.837692019
## Ireland                                   -2.62808538  0.362020337 -0.640755797
## United Kingdom                            -2.05713992  0.589584266 -0.407850345
## Spain                                     -2.33209334  1.348636610  0.700913469
## Mali                                       4.42864302 -0.725923093 -0.189211918
## Cambodia                                   1.49861802  0.434332761 -0.135571575
## Finland                                   -2.48617797  1.892468888  0.594038932
## Japan                                     -2.20169826 -0.609972749 -0.751252991
## Malta                                     -1.44928760 -1.028809545 -0.081348924
## Senegal                                    2.60628829  1.250222087  1.794207914
## Iceland                                   -2.74910896  2.053637043  0.419653402
## Burundi                                    2.92716442  2.104937966 -0.100589338
## Nepal                                      1.43508146  1.194786495  0.463613461
## Malawi                                     3.49174202  1.207150760 -0.936231767
## Cameroon                                   3.50119915  1.065163090  0.063364397
## New Zealand                               -2.49492486  1.269102839  0.014971514
## Greece                                    -1.99151909  0.052754647 -0.028068250
## Bahrain                                   -1.83627456 -1.277276062 -0.005796514
## Israel                                    -2.08527706  0.550448154 -0.402495966
## Bosnia and Herzegovina                    -0.49234291 -0.706061064  0.646486710
## Cyprus                                    -1.52126616 -0.301457929 -0.810134452
## Gambia                                     3.41276063 -0.007016863 -0.901829512
## Tunisia                                   -0.84854176 -0.830766824  1.960056511
## Papua New Guinea                           2.24414173 -0.288614274 -1.445469022
## Zambia                                     2.37579379  0.275956183 -0.731488256
## Morocco                                    0.36411731 -2.047307836  0.794721315
## Egypt                                      0.07863364 -2.585345419  0.217108451
## Slovenia                                  -2.16450489  0.836704528 -0.077310885
## Tanzania (United Republic of)              2.88648521  1.899305596  0.487200420
## Mauritius                                 -0.79068829 -0.937216376 -0.221835304
## Sudan                                      2.60420748 -1.045381397  1.334927961
## Haiti                                      2.36064900 -0.561384801 -1.290302990
## The former Yugoslav Republic of Macedonia -0.62281995  0.263132896  1.292063415
## Malaysia                                  -0.92235926 -0.984616060 -0.085310083
## Poland                                    -1.51988207  0.193167920 -0.111253539
## Czech Republic                            -1.94024133  0.041663314 -0.421174852
## Algeria                                   -0.88489346 -1.630270510  1.842073337
## Lithuania                                 -1.47424792  0.607825365 -0.365058067
## Ghana                                      2.01345921  0.355485039 -1.241712687
## Portugal                                  -1.99792574  1.040595945  0.180046605
## Trinidad and Tobago                       -0.43657185  0.131125955  0.074004863
## Argentina                                 -1.45601757  0.838984586  1.037680344
## Slovakia                                  -1.48297697 -0.019596892 -0.401935798
## Estonia                                   -1.68580633  0.371211438 -0.569235608
## Croatia                                   -1.30589585  0.406709990  0.159833574
## Indonesia                                  0.48036817 -0.567596783  0.116329285
## Iran (Islamic Republic of)                -0.96436342 -2.793788341  0.315434603
## Serbia                                    -0.83377077  0.705073607  0.951750046
## Chile                                     -1.22310783 -0.447088566 -0.163992357
## Hungary                                   -1.36809208 -0.493681621 -0.942204385
## Maldives                                  -0.67904281 -0.989578003 -0.802207742
## China                                     -0.67672686  0.359953468  0.049521929
## Lebanon                                   -1.23268082 -2.526681094  0.043940503
## Russian Federation                        -0.80811646 -0.054778581 -0.829752646
## Syrian Arab Republic                       0.36377295 -2.559738024  1.335263924
## Jordan                                    -0.57591739 -2.362089025  0.910026604
## Romania                                   -0.77188498 -0.403249223 -0.702527677
## Montenegro                                -1.11686706 -0.122488486 -0.280938313
## Latvia                                    -1.31130851  0.187341613 -0.642976519
## Suriname                                  -0.12432505 -0.997314985 -0.287838649
## Uganda                                     2.95628470  1.805866347  0.326356280
## Kazakhstan                                -0.75696807  0.549022028 -0.685887511
## Mexico                                    -0.53292222  0.345557424  1.529477321
## Zimbabwe                                   2.24835708  1.728001429  0.178907487
## Peru                                      -0.08991745  0.379961425 -0.057040060
## Thailand                                  -0.32856107 -0.610848412 -1.075159434
## Botswana                                   0.51070846 -0.005038506 -1.402977382
## Kenya                                      2.23833543  0.681110048 -0.446481460
## Belarus                                   -1.09207196  0.821722632  0.249889580
## Congo                                      2.50890521  0.529214816 -1.324187811
## Bolivia (Plurinational State of)           0.61863947  2.118079004  1.809884477
## Swaziland                                  2.33315345 -0.596949105 -0.306431397
## Paraguay                                   0.44428803 -0.500263835  0.076871476
## Bahamas                                   -0.80193053  0.256043765 -0.862499787
## El Salvador                                0.31502973 -0.082434305  0.932630500
## Bulgaria                                  -0.83448409  0.297191795 -0.354609758
## Sri Lanka                                 -0.75648305 -1.887423148 -0.154346306
## Azerbaijan                                -0.13345597  0.270223921 -0.917559403
## Bangladesh                                 1.17163475 -0.264831864  0.284988061
## Uruguay                                   -1.10762421 -0.407424570 -0.791498107
## Albania                                   -0.67704042 -0.334378287  0.233899467
## Viet Nam                                  -0.12772727  0.635409753 -0.023983110
## Panama                                    -0.52104983 -0.445966480  0.109079216
## Costa Rica                                -0.88088858  0.247783224  1.177525013
## South Africa                               0.52541319  1.290788972  0.929186725
## Brazil                                    -0.52713454 -0.437211056 -0.914408365
## Cuba                                      -0.85690790  1.206502306  2.170553442
## Barbados                                  -0.83161287  0.528012907 -0.591747431
## Guatemala                                  0.92978002 -1.021873670  0.076300346
## Ukraine                                   -0.61256414 -0.236511138 -0.828004404
## Colombia                                  -0.24456731 -0.012229219 -0.017543050
## Fiji                                      -0.54064402 -1.018100867  0.043783998
## Venezuela (Bolivarian Republic of)        -0.35614736 -0.328185500 -0.262447337
## Mongolia                                  -0.58002389  0.031790917 -0.819700835
## Ecuador                                   -0.48514379  1.062937942  1.432626667
## Tonga                                     -0.45134514 -1.165949078 -1.371247180
## Georgia                                   -0.38211425 -0.445864707 -0.649432206
## Namibia                                    0.58896079  1.429030129  0.561863707
## Dominican Republic                         0.15665142 -0.232069047 -0.005138172
## Rwanda                                     1.08474351  3.125519996  1.399402307
## Bhutan                                     0.21621916 -0.205695481 -1.274792504
## Armenia                                   -0.41497907 -0.613047804 -0.668870242
## Belize                                     0.03821611 -0.834020935 -0.164341793
## Philippines                                0.25460237  0.009222189  0.528877460
## Kyrgyzstan                                -0.09841728 -0.010254730  0.233419334
## Moldova (Republic of)                     -0.14463141  0.316328188 -0.310093407
## Samoa                                     -0.49340583 -2.067158649 -0.048787759
## Jamaica                                   -0.06175153  0.015692724 -0.477625887
## Nicaragua                                  0.33842550  0.570944859  1.600126090
## Honduras                                   0.37144496 -0.477332449  0.893880234
## Lesotho                                    2.41352165  0.966326284 -0.469152353
## Tajikistan                                 0.21994425 -0.299649920 -0.471229436
## Libya                                     -1.53980422 -1.493748457  0.207781921
## Guyana                                     0.81614017 -0.005126477  0.855888106
## Gabon                                      0.38951619  0.539298671 -1.477157917
## Myanmar                                    0.22129643 -0.339858349 -2.137967976
##                                                    PC4          PC5
## Qatar                                     -3.434350558 -2.744086530
## Kuwait                                    -2.268698795 -1.459482728
## Singapore                                 -2.145302064 -0.773851494
## Chad                                      -1.710289949 -0.313209384
## Saudi Arabia                              -1.483999840 -0.771322474
## Norway                                    -1.429085734 -0.634591949
## Afghanistan                               -1.307267500  0.171076164
## Luxembourg                                -1.167750947 -0.659139954
## Switzerland                               -1.166378061 -0.293310010
## Sierra Leone                              -1.136109156 -0.260135837
## Central African Republic                  -1.039430441 -0.027250931
## United States                             -0.972954474 -0.523731016
## Niger                                     -0.965501357 -1.303263726
## Yemen                                     -0.949498266  1.002500642
## Mozambique                                -0.926674225  0.313358992
## Congo (Democratic Republic of the)        -0.866980658  0.149837925
## Liberia                                   -0.821010099  0.337292077
## Burkina Faso                              -0.808281584  0.585365213
## Côte d'Ivoire                             -0.803721453 -0.656805429
## Netherlands                               -0.797993542 -0.024473662
## Oman                                      -0.795461293 -0.057279210
## United Arab Emirates                      -0.773527981 -1.767659249
## Mauritania                                -0.732229396  0.149028861
## Korea (Republic of)                       -0.691151610  0.615032289
## Denmark                                   -0.683471922  0.056081866
## Belgium                                   -0.676265477 -0.050602376
## Sweden                                    -0.672261525 -0.206796715
## Australia                                 -0.667575221  0.146603167
## Germany                                   -0.648265885 -0.140054500
## Togo                                      -0.644239048  0.928867228
## Pakistan                                  -0.633197073  0.676731900
## India                                     -0.628896340  0.852767619
## Austria                                   -0.615634290 -0.107843670
## Turkey                                    -0.590951238  0.618018515
## France                                    -0.570782466  0.293158744
## Iraq                                      -0.565370695 -0.287911201
## Italy                                     -0.553844809  0.335992609
## Canada                                    -0.536756869 -0.054260003
## Ethiopia                                  -0.522758647  0.638178331
## Benin                                     -0.510128117  0.862388776
## Ireland                                   -0.474385756  0.158551905
## United Kingdom                            -0.430565377 -0.121151666
## Spain                                     -0.408851517  0.350774791
## Mali                                      -0.407128283 -1.027734140
## Cambodia                                  -0.403276572  1.545495585
## Finland                                   -0.402199400 -0.035148222
## Japan                                     -0.370785436  0.142433142
## Malta                                     -0.354324825  0.284860129
## Senegal                                   -0.333524119  0.042331771
## Iceland                                   -0.327005382  0.264383077
## Burundi                                   -0.323059706  0.419879197
## Nepal                                     -0.316593826  1.114794397
## Malawi                                    -0.308982421  0.054072818
## Cameroon                                  -0.260759929 -0.555844104
## New Zealand                               -0.241859264  0.196214374
## Greece                                    -0.224756734  0.689741659
## Bahrain                                   -0.214807455 -0.784249226
## Israel                                    -0.191614837  0.468282001
## Bosnia and Herzegovina                    -0.182925626  1.209146178
## Cyprus                                    -0.181589001  0.539000803
## Gambia                                    -0.172392139  0.044491812
## Tunisia                                   -0.167303534  0.654406033
## Papua New Guinea                          -0.125337620  1.185766443
## Zambia                                    -0.098419324  0.125849595
## Morocco                                   -0.091930219  0.468826463
## Egypt                                     -0.084109671  0.384246559
## Slovenia                                  -0.070286122  0.519533942
## Tanzania (United Republic of)             -0.057669838 -0.228794202
## Mauritius                                 -0.053895731  0.417082038
## Sudan                                     -0.030656873 -0.712459363
## Haiti                                     -0.018017361  0.819230616
## The former Yugoslav Republic of Macedonia -0.011789161  0.757393781
## Malaysia                                  -0.007775677  0.250309695
## Poland                                     0.028293784  0.478839786
## Czech Republic                             0.034533898  0.396983608
## Algeria                                    0.036027293 -0.035468371
## Lithuania                                  0.038625393  0.469887414
## Ghana                                      0.053762696  0.570568434
## Portugal                                   0.070353716  0.343488087
## Trinidad and Tobago                        0.094363327 -0.478807410
## Argentina                                  0.100541901 -0.382426961
## Slovakia                                   0.108807736  0.184357318
## Estonia                                    0.125625669  0.348079365
## Croatia                                    0.137752539  0.539931997
## Indonesia                                  0.140336726  0.123615819
## Iran (Islamic Republic of)                 0.145504912 -0.038590210
## Serbia                                     0.150877958  0.741356169
## Chile                                      0.151604505 -0.029232073
## Hungary                                    0.190542428  0.422324617
## Maldives                                   0.204993662  1.087492380
## China                                      0.207098455  0.948090559
## Lebanon                                    0.219929309  0.207450407
## Russian Federation                         0.230843800  0.198560713
## Syrian Arab Republic                       0.236685678  0.186780574
## Jordan                                     0.239258300 -0.030840898
## Romania                                    0.270762447  0.358282124
## Montenegro                                 0.271709792  0.787994084
## Latvia                                     0.278702422  0.357077524
## Suriname                                   0.298543025 -0.101737220
## Uganda                                     0.305231215 -0.547035571
## Kazakhstan                                 0.308198625  0.224421990
## Mexico                                     0.313139790 -0.529001350
## Zimbabwe                                   0.329045089 -0.135099353
## Peru                                       0.346459243  0.357781868
## Thailand                                   0.347688665  0.621808637
## Botswana                                   0.355643457  0.015774383
## Kenya                                      0.390264593 -0.349458967
## Belarus                                    0.394723997  0.361435870
## Congo                                      0.398128908 -0.669796656
## Bolivia (Plurinational State of)           0.399174576 -0.250247599
## Swaziland                                  0.453557485 -0.664998668
## Paraguay                                   0.471807391  0.004514533
## Bahamas                                    0.488197758  0.057513980
## El Salvador                                0.498582285 -0.209535347
## Bulgaria                                   0.529608478  0.276207632
## Sri Lanka                                  0.538663748  0.378862799
## Azerbaijan                                 0.560060281  0.186188142
## Bangladesh                                 0.581364674 -0.152520365
## Uruguay                                    0.599660130 -0.231245152
## Albania                                    0.600494707  0.497187324
## Viet Nam                                   0.603302121  0.872257041
## Panama                                     0.606356209 -0.882843754
## Costa Rica                                 0.612051066 -0.448031363
## South Africa                               0.625719759 -0.626998851
## Brazil                                     0.635546781 -0.242856529
## Cuba                                       0.652802978 -0.141187770
## Barbados                                   0.716841534  0.216873945
## Guatemala                                  0.732023267 -0.737883028
## Ukraine                                    0.735815695  0.683855615
## Colombia                                   0.739148454 -0.404110156
## Fiji                                       0.751446198  0.004254590
## Venezuela (Bolivarian Republic of)         0.758276118 -0.904889663
## Mongolia                                   0.758449744  0.440072188
## Ecuador                                    0.768903920 -0.699510243
## Tonga                                      0.781942289  0.749978939
## Georgia                                    0.800279490  0.377801426
## Namibia                                    0.814718726 -0.404258828
## Dominican Republic                         0.820699804 -0.898427787
## Rwanda                                     0.823466529 -0.094488185
## Bhutan                                     0.843768126  0.344669908
## Armenia                                    0.845581564  0.440952425
## Belize                                     0.905689903 -0.354976024
## Philippines                                0.943707745 -0.508529110
## Kyrgyzstan                                 0.984423355  0.253963473
## Moldova (Republic of)                      0.994627115  0.470592772
## Samoa                                      1.016812799 -0.110042154
## Jamaica                                    1.044404927 -0.227410096
## Nicaragua                                  1.100369752 -1.082146297
## Honduras                                   1.206509214 -1.014293478
## Lesotho                                    1.226549015 -1.630682971
## Tajikistan                                 1.280122465  0.062883028
## Libya                                      1.311190236 -0.777481767
## Guyana                                     1.490657114 -1.914978588
## Gabon                                      1.778909476 -2.181779590
## Myanmar                                    2.275737391 -0.784101836
##                                                     PC6          PC7
## Qatar                                      0.7794028704 -0.869044046
## Kuwait                                     0.4433858104 -0.154579254
## Singapore                                  0.2087344376 -0.631938288
## Chad                                      -0.1384298783  0.505501713
## Saudi Arabia                               0.2995446875  0.814349653
## Norway                                     0.1057068050 -0.021435945
## Afghanistan                                0.1650292048  0.419870531
## Luxembourg                                 0.4269134315 -0.843634606
## Switzerland                                0.1865273520 -0.420485693
## Sierra Leone                               0.9362103161  1.063217796
## Central African Republic                   0.9515520853  0.428317282
## United States                             -0.3451427925 -0.002143577
## Niger                                     -1.4900241152 -0.772778138
## Yemen                                      0.3877720692 -0.105861134
## Mozambique                                -0.4315443547 -0.191228703
## Congo (Democratic Republic of the)        -0.6733926462  0.284824245
## Liberia                                   -0.4465735263  0.044506003
## Burkina Faso                              -0.3017379532 -0.688616909
## Côte d'Ivoire                              0.0332611178  0.771694885
## Netherlands                               -0.0840485953  0.299683270
## Oman                                       0.2559958563 -0.052902782
## United Arab Emirates                       0.4626591783 -0.456221585
## Mauritania                                 0.3835135567 -0.206485725
## Korea (Republic of)                       -0.2412194847  0.254405229
## Denmark                                   -0.0930002006  0.544652022
## Belgium                                    0.1848918584 -0.041147418
## Sweden                                     0.2506794343 -0.372812563
## Australia                                 -0.6199683954  0.917438777
## Germany                                    0.1990240456 -0.001114540
## Togo                                      -0.3551166163  0.469693499
## Pakistan                                   1.0547918604 -0.663717157
## India                                      0.2976255773  0.295702180
## Austria                                    0.2005291571 -0.252609069
## Turkey                                    -0.3388083756  0.259713548
## France                                    -0.0424944013 -0.194411707
## Iraq                                      -0.0205434516 -0.377293351
## Italy                                     -0.0518444082 -0.045035198
## Canada                                    -0.0899079143 -0.292960474
## Ethiopia                                   0.2799208080 -0.661627350
## Benin                                     -0.3370880163  0.133803308
## Ireland                                   -0.3678977300  0.633980162
## United Kingdom                            -0.3405838816 -0.083078182
## Spain                                     -0.2384440630  0.193672897
## Mali                                      -1.1754044432 -0.019086504
## Cambodia                                   0.0893450889 -0.641196445
## Finland                                    0.0500578223  0.125708116
## Japan                                     -0.1092196560 -0.295650382
## Malta                                     -0.1902293297 -0.234558008
## Senegal                                    0.1229101109 -0.989313955
## Iceland                                   -0.3893837420  0.527074951
## Burundi                                    1.8276366746  0.677851002
## Nepal                                     -0.5593956686 -0.297819695
## Malawi                                    -1.1968378168 -0.088176314
## Cameroon                                   0.0737457183  0.614653903
## New Zealand                               -0.7465178815  0.702424299
## Greece                                    -0.4677522487  0.508801969
## Bahrain                                    0.3793530928  0.116840333
## Israel                                    -0.1633957692 -0.182073253
## Bosnia and Herzegovina                    -0.0270242943 -0.181398112
## Cyprus                                     0.1192318589 -0.540291494
## Gambia                                    -0.3474528468 -0.395655997
## Tunisia                                    0.3893479627  0.555015084
## Papua New Guinea                           0.0929917875 -0.590106926
## Zambia                                    -1.2011155278  0.645660843
## Morocco                                    0.0385729540 -0.140453926
## Egypt                                     -0.3163885313  0.410135541
## Slovenia                                   0.0215875237  0.200046218
## Tanzania (United Republic of)             -0.4568728938 -0.686533066
## Mauritius                                 -0.3224620116  0.555744456
## Sudan                                      0.6102373647 -0.583319974
## Haiti                                      0.8528400232 -0.487249059
## The former Yugoslav Republic of Macedonia  0.2271429608 -0.225518600
## Malaysia                                   0.5943478473 -0.310882005
## Poland                                     0.0146512625  0.078155353
## Czech Republic                             0.0054545873  0.274517961
## Algeria                                    0.4902133796  0.630600521
## Lithuania                                  0.1583897022  0.463060174
## Ghana                                      0.3721749488  0.232143202
## Portugal                                  -0.1179107107  0.036420328
## Trinidad and Tobago                        0.5483599732 -0.287201887
## Argentina                                 -0.7562700661  0.998335318
## Slovakia                                   0.0857885836  0.030612783
## Estonia                                   -0.1433648078  0.319266420
## Croatia                                    0.1183552755 -0.081785909
## Indonesia                                  0.0697247267  0.318509670
## Iran (Islamic Republic of)                -0.3591669622  0.722852945
## Serbia                                     0.1960956378  0.001140764
## Chile                                     -0.9916455370 -0.141135886
## Hungary                                    0.0530148587  0.202738900
## Maldives                                   0.2331259089 -0.434462712
## China                                      0.3732800063 -0.461876446
## Lebanon                                   -0.0082757639  0.066977961
## Russian Federation                         0.1868210793  0.214125170
## Syrian Arab Republic                       0.0533687969  0.381438615
## Jordan                                     0.0960728327  0.426350376
## Romania                                   -0.1404008807 -0.081718774
## Montenegro                                -0.0735326488  0.115347703
## Latvia                                     0.1885997063  0.131359913
## Suriname                                   0.2671143004  0.054969268
## Uganda                                    -0.2663511002 -0.222926610
## Kazakhstan                                 0.1478071951  0.258894510
## Mexico                                    -0.3750194353 -0.284486795
## Zimbabwe                                   1.0326915615  0.484252449
## Peru                                      -0.2954092927 -0.346454974
## Thailand                                  -0.3728835623 -0.321836338
## Botswana                                   0.4208028626  0.051944196
## Kenya                                      0.0003908410  0.192825515
## Belarus                                    0.2002805861  0.442680455
## Congo                                     -0.7300608217  0.035047723
## Bolivia (Plurinational State of)          -0.0005896796  0.218008164
## Swaziland                                  0.9379950756  1.083913498
## Paraguay                                  -0.3868646873 -0.383448677
## Bahamas                                    0.2221208500 -0.684132379
## El Salvador                               -0.5208362283 -0.298642007
## Bulgaria                                  -0.1548721090 -0.094755523
## Sri Lanka                                  0.1134797992  0.181878318
## Azerbaijan                                 0.2228914450 -0.691575682
## Bangladesh                                -0.2801004193 -0.761911579
## Uruguay                                   -0.8520051813  0.123425781
## Albania                                    0.3719985522 -0.758087155
## Viet Nam                                   0.1370254776 -0.819669561
## Panama                                    -0.7794109136 -0.261632639
## Costa Rica                                -0.5801338772 -0.205490683
## South Africa                               0.8641957710  0.880893184
## Brazil                                    -0.9387997247  0.259841102
## Cuba                                      -0.1046529231 -0.129211242
## Barbados                                  -0.5554976971  0.143179329
## Guatemala                                 -0.6663882061 -0.519546181
## Ukraine                                   -0.0402324586  0.377186088
## Colombia                                  -0.5193595878 -0.089128360
## Fiji                                      -0.2586706159  0.954688210
## Venezuela (Bolivarian Republic of)        -0.8202034504  0.165151944
## Mongolia                                   0.3664312347  0.382144832
## Ecuador                                   -0.6229513015  0.060219694
## Tonga                                      0.0223647898  0.463117389
## Georgia                                   -0.4329675048 -0.129134324
## Namibia                                    0.6531061286 -0.279270660
## Dominican Republic                        -0.9977964148 -0.136457443
## Rwanda                                     1.4882291993 -0.291101527
## Bhutan                                     0.1354618087 -0.160322615
## Armenia                                    0.1512678759 -0.513028090
## Belize                                    -0.5257968357  0.220621752
## Philippines                                0.5735628979 -0.188977457
## Kyrgyzstan                                 0.4470455299 -0.082545945
## Moldova (Republic of)                      0.3856501921 -0.575902862
## Samoa                                      0.1862354772  0.197408492
## Jamaica                                   -0.5596006430 -0.552066779
## Nicaragua                                 -0.6811130040 -0.518838884
## Honduras                                  -0.3160976543 -0.377880488
## Lesotho                                    1.0758622144  1.138717611
## Tajikistan                                 0.3475875471 -0.503606390
## Libya                                      0.9330807855  0.570157750
## Guyana                                     0.3639403210 -0.027849786
## Gabon                                     -0.1202364303  0.236755731
## Myanmar                                    1.7676007033 -0.806312423
##                                                    PC8
## Qatar                                     -0.544730977
## Kuwait                                    -0.577682871
## Singapore                                 -0.058473974
## Chad                                       0.720298478
## Saudi Arabia                              -0.334517472
## Norway                                    -0.123820241
## Afghanistan                               -0.184925327
## Luxembourg                                 0.088632850
## Switzerland                                0.149931235
## Sierra Leone                               1.384425435
## Central African Republic                   0.645440709
## United States                             -0.117040628
## Niger                                      0.358842606
## Yemen                                     -0.164016287
## Mozambique                                -0.793719582
## Congo (Democratic Republic of the)         0.535636325
## Liberia                                    0.479166548
## Burkina Faso                              -0.537702988
## Côte d'Ivoire                              0.135308345
## Netherlands                                0.062106304
## Oman                                       0.014389692
## United Arab Emirates                      -0.094267148
## Mauritania                                -0.121129103
## Korea (Republic of)                        0.313970540
## Denmark                                   -0.054496911
## Belgium                                    0.045294793
## Sweden                                     0.117156987
## Australia                                  0.122261331
## Germany                                    0.085755248
## Togo                                      -0.266806482
## Pakistan                                  -0.193259634
## India                                     -0.049179210
## Austria                                    0.146278110
## Turkey                                    -0.161342462
## France                                     0.236234797
## Iraq                                      -0.466207100
## Italy                                      0.341553582
## Canada                                     0.172678430
## Ethiopia                                   0.123086514
## Benin                                     -0.550244820
## Ireland                                    0.142599173
## United Kingdom                             0.075500247
## Spain                                      0.220613119
## Mali                                      -0.051504428
## Cambodia                                  -0.343470935
## Finland                                    0.056001135
## Japan                                      0.466519718
## Malta                                      0.261913162
## Senegal                                   -0.050182834
## Iceland                                    0.162465364
## Burundi                                    0.872691209
## Nepal                                     -0.312497632
## Malawi                                     0.078962173
## Cameroon                                   0.012924475
## New Zealand                                0.134924597
## Greece                                     0.204463481
## Bahrain                                    0.116941358
## Israel                                     0.298293132
## Bosnia and Herzegovina                    -0.037480926
## Cyprus                                     0.238419359
## Gambia                                    -0.154919425
## Tunisia                                    0.085563570
## Papua New Guinea                          -0.568359399
## Zambia                                    -0.794475565
## Morocco                                    0.283122879
## Egypt                                     -0.229456272
## Slovenia                                   0.201136871
## Tanzania (United Republic of)              0.031354925
## Mauritius                                  0.021314559
## Sudan                                      0.195483201
## Haiti                                      0.202062572
## The former Yugoslav Republic of Macedonia -0.140384418
## Malaysia                                   0.025054670
## Poland                                    -0.006129686
## Czech Republic                             0.125812767
## Algeria                                    0.366520434
## Lithuania                                 -0.330632189
## Ghana                                     -0.037526847
## Portugal                                   0.229393237
## Trinidad and Tobago                       -0.252645792
## Argentina                                  0.001276173
## Slovakia                                  -0.049162831
## Estonia                                   -0.047421539
## Croatia                                    0.040691396
## Indonesia                                  0.017986999
## Iran (Islamic Republic of)                 0.148533006
## Serbia                                    -0.152580467
## Chile                                      0.323884955
## Hungary                                   -0.059973220
## Maldives                                   0.197975099
## China                                      0.025315956
## Lebanon                                    0.488700890
## Russian Federation                        -0.481743543
## Syrian Arab Republic                      -0.199383676
## Jordan                                     0.176778023
## Romania                                   -0.075240568
## Montenegro                                -0.024931977
## Latvia                                    -0.169022123
## Suriname                                   0.088444159
## Uganda                                    -0.572916878
## Kazakhstan                                -0.575145605
## Mexico                                     0.070725796
## Zimbabwe                                   0.016998870
## Peru                                       0.026102641
## Thailand                                  -0.146109071
## Botswana                                  -0.413437307
## Kenya                                      0.026234807
## Belarus                                   -0.466662216
## Congo                                     -0.017934893
## Bolivia (Plurinational State of)          -0.183113873
## Swaziland                                 -1.057558817
## Paraguay                                   0.035958347
## Bahamas                                    0.017655734
## El Salvador                               -0.142869401
## Bulgaria                                  -0.221248901
## Sri Lanka                                  0.190906050
## Azerbaijan                                -0.410686138
## Bangladesh                                 0.135731141
## Uruguay                                    0.027424769
## Albania                                    0.295251360
## Viet Nam                                   0.087508486
## Panama                                     0.311755507
## Costa Rica                                 0.303047765
## South Africa                              -1.053520373
## Brazil                                    -0.015904503
## Cuba                                       0.481755252
## Barbados                                   0.027150536
## Guatemala                                  0.082547026
## Ukraine                                   -0.298001037
## Colombia                                   0.040791381
## Fiji                                      -0.192307926
## Venezuela (Bolivarian Republic of)         0.135464359
## Mongolia                                  -0.229052691
## Ecuador                                    0.136462760
## Tonga                                      0.306781672
## Georgia                                    0.042053001
## Namibia                                   -0.508079106
## Dominican Republic                        -0.001090328
## Rwanda                                     0.123998389
## Bhutan                                    -0.074744150
## Armenia                                    0.092802384
## Belize                                    -0.319139671
## Philippines                               -0.097634576
## Kyrgyzstan                                -0.042950292
## Moldova (Republic of)                     -0.213918903
## Samoa                                      0.295420249
## Jamaica                                    0.212157687
## Nicaragua                                  0.168834839
## Honduras                                   0.264847993
## Lesotho                                   -0.302736425
## Tajikistan                                -0.216743506
## Libya                                      0.150608995
## Guyana                                     0.283691259
## Gabon                                     -0.016809690
## Myanmar                                    0.464249830

4. Interpret biplot

  • Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

The principal components

I already touched on this in the previous section, but as the first principal components align with Mat.Mor, Ado.Birth, Life.Exp, GNI, Edu.Exp, and Edu2.FM, it seems to capture the general wellbeing of society. Or rather, the lack thereof, since it’s pointing in the other direction.

The second axis measures gender equality, as it aligns well with Parli.F and Labo.FM.

Social equality trend line

It is good to point out at this point that there is a clear trend line in the upper left corner of the biplot, moving from the right bottom to the top left. It seems like most of the countries are in this trend line. There is then a large spread of the rest of the countries, which is why this trend is not aligned with the principal components. Let’s call this trend line the social equality trend line, as all the countries at the top of this trend line have social welfare programs, such as universal healthcare, strong unions, and low GINI indices. (A bit hand-wavy, but this is a course assignment, not a research paper).

Correlations

There are some additional interesting correlations in the original variables with the PCs. Mat.Mor and Edu.Exp both point slightly in the positive PC2 direction. This doesn’t make a lot of intuitive sense if we consider PC2 to capture gender equality or liberal values. However, Mat.Mor is pointing in the opposite direction from the social equality trend line, which may suggest that the principal components aren’t aligned in any particularily meaningful direction. PCA has maximiced the dispersion along PC1 and PC2, but it does not necessarily mean that the axes are meaningful in themselves, the axes may be slightly misaligned compared to the labels I’ve given them.

This idea is supported by the fact that PC1 is bimodal, so the second, smaller mode might then be pulling the PCA slightly off the trend line. The other mode consists mostly of African countries and other developing nations. The fact that the other variables are distributed differently here might be due to artifacts of colonialism or some other big systemic differences between developed and developing nations.

ggplot(data.frame(pca_scaled$x), aes(x=PC1)) + geom_density()
fig. 3.2, PC1 distribution, clearly bimodal

fig. 3.2, PC1 distribution, clearly bimodal

Parli.F and Labo.FM are pointing in slightly different directions from each other, with Parli.FM slightly aligned with GNI, and Labo.FM completely orthogonal to it. This suggests that female labor participation has no effect on GNI (it doesn’t matter for the economy what the sex of a worker is), but female leadership has a positive correlation with GNI (of course, we don’t have a causal link here, only a correlation).

EXTRA: redoing the PCA with the log-transformed data

Just for fun, let’s use the log transformed data to see if some of those correlations change

pca_log_scaled <- human.log %>% scale %>% as.data.frame %>% prcomp
summary(pca_log_scaled)
biplot(pca_log_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"))
fig. 4.1, biplot for PCA of the standardized and log transformed data

fig. 4.1, biplot for PCA of the standardized and log transformed data

Sure enough, if we look at the summary, we see that the first PC now captures 57% of the variance compared to 53.6% before, a better result. The first four PCs now capture 90% of the variance, after that we get diminishing returns.

pca_log_scaled
## Standard deviations (1, .., p=8):
## [1] 2.1505310 1.1259172 0.8681023 0.7727862 0.5558253 0.4410956 0.3827107
## [8] 0.3267292
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4         PC5
## Edu2.FM   -0.32850815  0.01067147 -0.35894548  0.77560185 -0.36060516
## Labo.FM    0.03263713  0.72942195 -0.61074433 -0.23458714  0.05083415
## Life.Exp  -0.42542035 -0.05622607  0.11141581 -0.03806539  0.26227538
## Edu.Exp   -0.42035073  0.09697456 -0.06818573 -0.03839701  0.44933876
## GNI       -0.43314660 -0.09480875 -0.01735880  0.06836804  0.13860792
## Mat.Mor    0.43672168  0.01162213 -0.02391875  0.21779671 -0.12263401
## Ado.Birth  0.38284906  0.03041796 -0.03340255  0.48515646  0.74256569
## Parli.F   -0.09178670  0.66724454  0.69216873  0.23021937 -0.10502880
##                   PC6         PC7         PC8
## Edu2.FM    0.05763161 -0.12121944 -0.11626173
## Labo.FM    0.12840535  0.11338286  0.08313215
## Life.Exp   0.73375796 -0.22248569  0.38118853
## Edu.Exp   -0.56661984 -0.53272731  0.03187199
## GNI       -0.27117249  0.74980849  0.37876162
## Mat.Mor   -0.17790440 -0.22051498  0.81597528
## Ado.Birth  0.11877107  0.16353812 -0.15412247
## Parli.F   -0.03795809  0.03959254 -0.01489879

Looking closer at the coefficients, PC1, PC2, and PC3 have roughly the same interpretation as before. PC4 has changed now, but it is still as difficult to interpret as before.

5. MCA on tea data

The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Load the tea dataset and convert its character variables to factors:

tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)

  • Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
  • Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!).
  • Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA.
  • Comment on the output of the plots. (0-4 points)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300  36
#View(tea)

The dataset seems to consist of 300 rows of 36 answers to questionnaire, each row represents one questionnaire.

Summary statistics follow.

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

Factominer documentation (https://rdrr.io/cran/FactoMineR/man/tea.html) says the first 18 vars are about how they drink tea, 19 is age, and the rest are personal questions and “product’s perception” which I’m choosing to interpret as how they think about tea.

5.1 MCA

There are a lot of variables here, so let’s choose a sensible subset of them to look at. Let’s choose the following variables, which I’m interpreting free-hand here, since I can’t find much metadata:

  • Tea: what kind of tea out of three types the respondent prefers (green, black, Earl Grey)
  • price: the price level of the tea that the respondent prefers
  • How: whether the respondent takes tea as is, with lemon, with milk, or in some other way
  • sex: the sex of the respondent
  • SPC: some kind of general social group for the respondent (student, employee (white collar?), middle (management?), senior)
  • age_Q: the age group of the respondent
  • frequency: how often the respondent drinks tea

of these, Tea, price, and How are “active” variables, and the rest are “supplementary” variables.

As a sidenote, this dataset is very badly documented.

tea_filtered <- tea %>% dplyr::select(one_of("Tea", "price", "How", "sex", "SPC", "age_Q", "frequency"))
summary(tea_filtered)
##         Tea                  price        How      sex               SPC    
##  black    : 74   p_branded      : 95   alone:195   F:178   employee    :59  
##  Earl Grey:193   p_cheap        :  7   lemon: 33   M:122   middle      :40  
##  green    : 33   p_private label: 21   milk : 63           non-worker  :64  
##                  p_unknown      : 12   other:  9           other worker:20  
##                  p_upscale      : 53                       senior      :35  
##                  p_variable     :112                       student     :70  
##                                                            workman     :12  
##    age_Q          frequency  
##  +60  :38   +2/day     :127  
##  15-24:92   1 to 2/week: 44  
##  25-34:69   1/day      : 95  
##  35-44:40   3 to 6/week: 34  
##  45-59:61                    
##                              
## 

Let’s look at a summary of the MCA of the dataset based on these specs above.

mca=MCA(tea_filtered,quali.sup=4:7 ,graph=F)
summary(mca)
## 
## Call:
## MCA(X = tea_filtered, quali.sup = 4:7, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.423   0.399   0.383   0.355   0.335   0.330   0.316
## % of var.             12.688  11.967  11.497  10.653  10.043   9.907   9.484
## Cumulative % of var.  12.688  24.655  36.152  46.805  56.848  66.755  76.239
##                        Dim.8   Dim.9  Dim.10
## Variance               0.283   0.271   0.238
## % of var.              8.490   8.142   7.129
## Cumulative % of var.  84.729  92.871 100.000
## 
## Individuals (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1               |  0.720  0.409  0.056 |  1.341  1.503  0.196 |  0.789  0.541
## 2               |  0.782  0.481  0.216 |  0.593  0.294  0.124 |  0.071  0.004
## 3               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 4               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 5               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 6               | -0.357  0.100  0.027 | -0.078  0.005  0.001 |  0.344  0.103
## 7               | -0.462  0.168  0.231 |  0.110  0.010  0.013 | -0.707  0.435
## 8               |  0.782  0.481  0.216 |  0.593  0.294  0.124 |  0.071  0.004
## 9               |  0.497  0.195  0.083 |  0.222  0.041  0.016 |  0.544  0.257
## 10              |  1.104  0.961  0.443 | -0.678  0.384  0.167 |  0.138  0.016
##                   cos2  
## 1                0.068 |
## 2                0.002 |
## 3                0.542 |
## 4                0.542 |
## 5                0.542 |
## 6                0.025 |
## 7                0.542 |
## 8                0.002 |
## 9                0.099 |
## 10               0.007 |
## 
## Categories (the 10 first)
##                     Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black           |   1.366  36.278   0.611  13.516 |  -0.083   0.143   0.002
## Earl Grey       |  -0.440   9.794   0.348 -10.207 |   0.312   5.225   0.175
## green           |  -0.493   2.106   0.030  -2.996 |  -1.636  24.612   0.331
## p_branded       |  -0.497   6.176   0.115  -5.856 |  -0.218   1.257   0.022
## p_cheap         |   1.116   2.289   0.030   2.982 |   1.313   3.361   0.041
## p_private label |   0.018   0.002   0.000   0.084 |  -0.132   0.102   0.001
## p_unknown       |   0.314   0.310   0.004   1.108 |   2.953  29.139   0.363
## p_upscale       |   1.063  15.724   0.242   8.512 |  -0.874  11.272   0.164
## p_variable      |  -0.188   1.036   0.021  -2.504 |   0.225   1.575   0.030
## alone           |  -0.275   3.870   0.140  -6.477 |  -0.328   5.831   0.199
##                  v.test     Dim.3     ctr    cos2  v.test  
## black            -0.825 |   0.099   0.209   0.003   0.977 |
## Earl Grey         7.240 |  -0.247   3.420   0.110  -5.741 |
## green            -9.947 |   1.224  14.341   0.185   7.443 |
## p_branded        -2.566 |   0.459   5.803   0.098   5.403 |
## p_cheap           3.509 |   0.315   0.201   0.002   0.841 |
## p_private label  -0.626 |   1.039   6.569   0.081   4.928 |
## p_unknown        10.421 |   1.519   8.030   0.096   5.362 |
## p_upscale        -6.999 |   0.310   1.476   0.021   2.483 |
## p_variable        2.999 |  -0.913  27.080   0.497 -12.188 |
## alone            -7.721 |  -0.153   1.330   0.044  -3.614 |
## 
## Categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## Tea             | 0.611 0.359 0.207 |
## price           | 0.324 0.559 0.565 |
## How             | 0.334 0.279 0.378 |
## 
## Supplementary categories (the 10 first)
##                    Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3   cos2
## F               | -0.099  0.014 -2.071 | -0.001  0.000 -0.018 | -0.092  0.012
## M               |  0.145  0.014  2.071 |  0.001  0.000  0.018 |  0.135  0.012
## employee        | -0.153  0.006 -1.307 | -0.104  0.003 -0.891 |  0.027  0.000
## middle          |  0.179  0.005  1.213 |  0.095  0.001  0.645 |  0.033  0.000
## non-worker      |  0.281  0.021  2.531 | -0.214  0.012 -1.927 | -0.010  0.000
## other worker    |  0.221  0.003  1.021 |  0.174  0.002  0.804 |  0.108  0.001
## senior          |  0.226  0.007  1.419 |  0.086  0.001  0.543 |  0.009  0.000
## student         | -0.350  0.037 -3.343 |  0.108  0.004  1.027 | -0.090  0.002
## workman         | -0.327  0.004 -1.154 |  0.167  0.001  0.588 |  0.133  0.001
## +60             |  0.702  0.072  4.624 | -0.316  0.014 -2.078 | -0.033  0.000
##                 v.test  
## F               -1.930 |
## M                1.930 |
## employee         0.229 |
## middle           0.225 |
## non-worker      -0.091 |
## other worker     0.498 |
## senior           0.056 |
## student         -0.861 |
## workman          0.470 |
## +60             -0.218 |
## 
## Supplementary categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## sex             | 0.014 0.000 0.012 |
## SPC             | 0.068 0.020 0.004 |
## age_Q           | 0.130 0.022 0.014 |
## frequency       | 0.011 0.007 0.026 |

5.2

Let’s start with a plot of the individuals (ind):

par(mfrow=c(2,2))
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)

I see no real clusters, but there seem to be some duplicates. Dimension 1 covers 12.6% of the variance, and dimension 2 covers 12% of the variance, roughly the same. The scatter plot forms a triangle in this space, with a wide base forming around the Dim 1 = 0 and extending to the right.

Now let’s plot the var variables — Tea, price, and How — which the plot function for FactoMineR’s MCA colors black, red, and green, respectively.

plot(mca, graph.type="ggplot", invisible=c("ind","quali.sup","quanti.sup"), cex=0.8, habillage="quali")

Dimension 1 seems to cover: - most strongly (Tea: 0.61), whether the respondent prefers black tea or not (black +1.36, green -0.49, Earl Grey -0.44) - then (How: 0.334), whether the respondenr likes to add things to their tea (alone -0.28, lemon, milk, other all positive +) - and finally, no clear correlation with price, since both upscale and cheap are positive in dimension 1

Dimension 2 seems to cover: - whether the respondent is unlikely to prefer green tea (-1.636) - whether the respondent likes cheap tea (cheap +1.313, upscale -0.874)

The p_unknown category in the price has the highest coefficient here, which makes this one tough to interpret.

Finally, let’s look at the demographics of the individuals and see how they distribute over this space:

plot(mca, graph.type="ggplot", invisible=c("ind","var","quanti.sup"), cex=0.8, habillage="quali")

It seems that age is the best explaining variable for the differences in respondents tea drinking habits (dim 1: 0.130, dim 2: 0.022), followed by SPC, which does encode some of the same information as age, so perhaps it is redundant. If we interpret the dimensions from before we could roughly say that: - older people are more likely to prefer generic black tea to green tea or Earl Grey - younger people are more likely to prefer Earl Grey - green tea is most likely to be preferred by people in the 25-34 age range who are “employees” (office workers?) - older people prefer to add milk, lemon, or something else to their tea

There is also a slight difference in the distribution of answers between sexes: - men more likely to prefer black tea - women more likely to prefer Ear Grey or green tea

The frequency of tea drinking is all over the place and forms no clear trendline in the biplot.